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РКЕРАСЕ' 


This book has been written primarily for prospective teachers 
who want to know how mental tests can be of help in their school 
work, It can serve also as a guide for teachers in service. The first 
seven chapters describe the varieties of mental tests and point out 
the usefulness and limitations of each sort. The last three chapters 
deal with the writing of objective items, with the construction 
of classroom tests, and with some of the ways in which mental 
tests can usefully be employed in guidance and counselling. 
Chapter 2 covers in summary fashion the statistical terms and 
procedures most often used with mental tests. I do not believe it 
possible to describe mental tests intelligently without using rele- 
vant statistical terms. At the same time, I think that the classroom 
teacher need not be a psychometrician or testing specialist in 
order to use standard tests in the school. For those who want to 
go further into test construction, there is an Appendix which 
treats statistical method more fully. 

I have found it generally better to teach Chapter 2 before 
taking up a discussion of mental tests themselves—to use it, that 
is, as a preliminary to later chapters. Chapter 2 can then be re- 
ferred to specifically when the various statistical terms occur. 
This procedure has the advantage of reviewing the basic statis- 
tics when the need arises. 

I believe that the book will be found to contain ample material 
for one term’s work. This is especially true when the laboratory 
exercises and questions at the ends of the chapters are covered in 
class discussion, and when reports upon relevant literature are 
required. 


Henry E. Garrett 
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СНАРТЕК 1 


MENTAL TESTS IN THE SCHOOLS 


The Teacher and Mental Tests 


The widespread use of standard tests in today’s schools ren- 
ders it increasingly necessary for the classroom teacher to be 
familiar with these devices, with what they are and what they do. 
Teachers are often required to administer and score tests and 
frequently to use these scores in the evaluation of pupil capa- 
bilities and future promise. This is essential, of course, if the 
standard test is to have value in the work of the school. Most 
teachers, however, have no desire to become testing specialists 
or psychometricians, and many have little knowledge of modern 
statistical method. For these reasons, books dealing chicfly with 
the statistics of test construction and with other technical prob- 
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lems, while a necessary part of the training of school and clinical 
psychologists, are often not very useful to the teacher. In fact, 
they may leave him more confused than enlightened. 

This book is planned to present a comprthensive account of 
standard tests for teachers and for others not planning to become 
specialists in this field. It is not a book on statistical method, it 
does not deal broadly with the history of testing, nor with the 
applications of tests to problems of business and industry. Instead, 
it describes the various sorts of test, their uses and abuses, and how 
they supplement and aid the work of the classroom. Statistical 
terms necessary to an understanding of the tests themselves are 
defined and illustrated, but detailed calculations are not included 
in the text. The book’s usefulness will be enhanced if the exer- 
cises and topics at the ends of the chapters are carefully worked 
through. It is highly desirable, too, that the instructor have the 
class examine, take and score a number of tests. The discussion 


in a chapter will be clarified when there is actual familiarity with 
the tests described. ; 


What Mental Tests Are 


In a mental test, the examinee is confronted with a variety of 
tasks—questions to be answered, problems to be solved, direc- 
tions to be followed. Answers may be given orally, in writing, 
and sometimes by marking or manual manipulation, as, for 
example, by fitting blocks into apertures. Mental tests differ from 
physical tests, though there is considerable overlap in the two 
sorts of measurement. Both varieties of test require previous 
learning, and both present problems, but the mental test—to a 
greater degree than the physical—demands verbal abstraction 
rather than action, ideas rather than muscles. Tests of physical 
fitness—of height, weight, and physical strength, for example— 
differ most markedly from mental tests; in other words, are most 
physical. Tests which require speed and accuracy of hand-eye 
or hand-ear co-ordination, which demand manual dexterity and 
skill (called Sensory-motor tests) are both mental and physical. 
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But none of these tests is as “mental” as is the intelligence test 
or school examination in algebra or history, since none of them 
depends to so large a degree upon verbal symbols. 

~ The term тлела test is sometimes restricted to the measure- 
ment of intelligence or aptitude, examinations in school subjects 
being classified as educational achievement tests. The reasoning 
here is that the mental test—the intelligence test, for example— 
tells us how much a child сал learn, whereas the school examina- 
tion tells us what he has already learned. To some extent this is 
true. But the distinction between the two sorts of measurement 
is one of degrce rather than of kind. No mental test measures 
potential ability except by way of performance. We possess no 
microscope by which we can discover the inherited qualities of 
a child's brain or nervous system. The general intelligence test, 
to a greater degree than the school examination, measures poten- 
tial ability because it draws more upon native alertness than upon 
routine school learning. But the school examination also draws 
upon native alertness as expressed in school learning, and both 
sorts of test demand the use of symbols—words, diagrams, 
numbers, pictures, Accordingly, in this book the term mental test 
will be used to describe both sorts of examination. 

The primary objective of a mental test is to detect individual 
differences—that is, to discover how one child compares or 
“stacks up” against another child of the same age, sex or grade 
classification. This knowledge, as we shall see later, is useful in 


© many ways in school and out. A second objective of the mental 


test is to discover iztra-individual differences or the variations 
in performance within an individual. The scores made by an 
examinee, when put in comparable units and represented on a 
profile, provide a useful record of the examince’s strengths and 


weaknesses. 


A Classification of Mental Tests 
In beginning the study of mental tests, it will be helpful to 
draw up a list of the different varieties of tests. Most widely used 
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tests are standardized for procedure and results. A standardized 
educational achievement test, for example, is one that has been 
constructed in accordance with the best principles of test making 
and has been administered to hundreds of pupils in those grades 
for which the test is suitable. Results from standard tests are 
expressed as norms. These are typical scores earned by large 
groups of children believed to be representative of various ages 
and grades. For example, a score of 45 on a standard reading test 
may be the norm for children 9 years, 6 months old; or for chil- 
dren who are just beginning the fourth grade. 

The following outline gives some notion of the field to be 
covered and at the same time furnishes an overvicw of the 
chapters to follow. 

AA ARIETIES OF MENTAL T 
1. Intelligence Tests 

(1) individual: administered to one examinee at a time 

(2) group: administered, like a school ex: 

examinees at the same 

(3) performance: 


amination, to many 
time 
make little or no use of language, in con- 


trast with the paper-and-pencil tests in (1) and (2) 
П. Educational Achievement Tests 


(1) survey: comprehensive ex: 
general academic standing 

(2) subject: examinations in specific fields—for cxample, 
physi ‚ Spanish i 

(3) diagnostic: cover a wide range of academic skills (in 
reading or arithmetic, 


for example) and are designed to 
reveal specific Weaknesses and strengths 
ПІ. Aptitude Tests 
(1) general: for example, of 
(a) mechanical ability 
(b) clerical ability ^ 
(2) special: aptitude f 


aminations used to determine 


е for sc 
istry or foreign langua 
(3) professional: s 
(а) law 
(b) medicine 
(c) engineering 
(d) teaching ~ 


hool subjects—for example, chem- 


ges; differential aptitudes 
for example, in 
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(4) talent: aptitude in such fields as 
(а) art 
(b) music 

IV. Tests of Various Aspects of Personality 

(1) personal adjustment questionnaires: surveys of worries, 
fears, social inadequacies 

(2) attitude surveys: upon, for example, social, economic 
and political questions 

(3) inventories of interests as related to various occupations 

(4) environmental factors related to personality: question- 
naires covering socio-economic background and other 
variables 

(5) projective techniques: subtle and indirect measures of 
dominant personality trends 


All these mental tests will be treated in subsequent chapters. 
The following sections of this chapter provide a brief outline of 
the development of psychological testing in order to clear the 
ground for later work. For a more complete discussion of the 
historical development of mental tests, the student should consult 
references at the end of this chapter. 


The Beginnings of Mental Tests 


Interest in psychological testing developed in Germany and 
France about the middle of the last century. This interest grew 
out of the acute need for a better understanding of feebleminded- 
ness and the various forms of insanity. Tests were devised for 
the purpose of determining what the feeble-minded person can 
learn, how much he can learn, and in what respects he differs 
most drastically from the normal. In the case of the insane and 
the mentally deteriorated, brief tests were drawn up for assessing 
loss of memory, distortions of perception, distractibility, mental 
fatigue, and changes in such sensory-motor functions as speed 
and accuracy of motor responses. 

In England, interest in mental testing arose from the study of 
individual differences in mental and physical functions, The 
leader in this movement was Sir Francis Galton, an eminent 
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geneticist, who set up a testing laboratory in London in 1882. 
Here, for a small fee, a person could have the keenness of his 
vision and hearing tested, as well as his muscular strength and 
his speed and co-ordination of response. Galton’s tests were quite 
brief and sampled rather narrow aspects of behavior. In fact, 
they were sensory-motor rather than strictly mental in char- 
acter. One of the first American psychologists to become inter- 
ested in mental testing was James McKeen Cattell. Cattell іп- 


troduced mental tests of the Galton type in this country at the 
turn of the century. 


Intelligence Tests: Individual 


The individual intelligence test as we know it today grew out 
of the work of Alfred Binet, a French psychologist, who was 
director of the laboratory for physiological psychology at the 
Sorbonne. In 1904 Binet was asked to devise a mental test suit- 
able for use in detecting slow learners in the schools of Paris. 
The test was to be used not only to sift out the subnormal chil- 
dren in the grades but also to provide a better understanding of 
degrees of feeblemindedness, with a view to improving the 
education of these children. In 1905, Binet, with a collaborator, 
Theophile Simon brought out the first scale for measuring intel- ` 
ligence. This scale consisted of thirty problems and questions 


arranged in order from easy to hard. A second edition of Binet's 
Scale appeared in 1908 


These tests differed sh 


sight, He avoided questions which demanded 


‚ instead of 
6 x 3 or the name of the 


child to repeat four digits 


asking the examir 
largest City in France, Binet asked the 


ES TO пур 7 7 O у чуч ы э“ 
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(single numbers) or the words of a sentence (heard only once); 
to tell the “thing to do” in specific problem situations; to 
criticize (“see through”) an absurd statement or fallacy; to give 
differences between, for instance, a president and a king; to 
define abstract words like justice and loyalty. 

Binet’s famous tests became the basis for the widely used 
Stanford Revision of the Binet-Simon Scale, described in Chapter 
3. To Binet belongs the credit for having set up the first “age 
scale"—that is, a test series in which items are arranged or 
grouped by age levels. A child’s “score” on an age scale is deter- 
mined by the level attained and is expressed by a mental age 
(MA), which denotes the child’s maturity. ; 

Children of preschool age are unable to do tests which require 
reading and word knowledge. For these children, therefore, as 
well as for children handicapped in speech, vision, or hearing, 
and for the non-English speaking, performance tests must be 
used. In a typical performance test, the child is asked to identify 
common objects, string beads, build towers of blocks; or he may 
be asked to fit blocks into cutouts, arrange pictures in sequence, 
match the colors of cubes. Performance tests have been devised 
for use with illiterate and less intelligent adults as well as with 
children. 


intelligence Tests: Group 


When intelligence tests are administered to large groups of 
examinees at the same time, they are appropriately called “group 
tests.” The first group tests were developed (in 1917) during 
World War I. Together with other information, these tests 
were used (1) in accepting or rejecting men, (2) in the classifica- 
tion of those accepted, (3) in the assignment of draftees to 
various types of service, and (4) in determining admission of 
candidates to officer training schools. There were two kinds of 
group test, called Army Alpha and Army Beta. The first was 
intended for soldiers who could read and write; it required that 
an examinee follow fairly involved directions, solve “mental 
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arithmetic” problems, know the meanings of words, and perceive 
relations (for example, in an analogies test the question might 
be as follows. Hand is to foot as give isto | 22 ` _). Army Beta 
was a non-language or non-verbal test. It made use only of 
diagrams, pictures, and numbers and was answered by a simple 
system of marking. Army Beta was administered to the illiterate 
and the foreign-born. Directions were given in pantomime for 
the benefit of those soldiers who did not understand English. 

During World War Il, a group intelligence test called the 
Army General Classification Test (AGCT) was administered to 
some 12,000,000 men. AGCT is a verbal or language test. It 
includes three sorts of materials: verbal (vocabulary), numerical 
(arithmetic problems), and spatial (for example, problems in 
spatial relations presented by pictures of block piles to be 
“counted” by the examinee), No specific "school" questions 
were asked since the test was designed to measure mental alert- 


ness in dealing with symbolic materials apart from specific train- 
ing. Both Alpha and AGCT are still used in the testing of adults. 
ў Between World Wars I and П, scores of group tests of intel- 
ligence were constructed and used widely in the schools and col- 
leges. These and other mental tests (aptitude, personality) have 
been widely employed in business and industry as an aid in the 
selection and placement of personnel. 

In Most group intelligence examinations, items are answered 
by marking one of several possible solutions (multiple-choice), 
by selecting one of two answers (truc-false), and by checking or 
underlining the appropriate reply among several options. These 
answer techniques are called “objective” (p. 185), because in 
scoring such tests the judgment of the examiner does not enter 
in—or does so to a ver 


slight degree. G f intelli 
ary Genes g gree. Group tests of intelligence 


Educational Achievement Tests 


Since World War I, a number of tests of educational achieve- 
ment have been Construeted on objective principles. These tests 
are used to determine general educational level or standing, as 
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well as knowledge of a given subject field—as, for example, 
geometry or French. The general survey test, when used in the 
elementary school, is a comprehensive examination of the stu- 
dent’s knowledge of reading, spelling, arithmetic, grammar and 
literature, history and elementary science. Ses in separate 
subjects—history or physics, for example—are also available at 
educational levels from the secondary school to college. Educa- 
tional achievement tests are called diagnostic when they are used 
to reveal a student’s weaknesses in a particular area such as 
arithmetic or reading. Diagnostic tests must of necessity cover a 
wide range of information and skills in a given subject. Educa- 
tional achievement tests are described in Chapter 5. 


Aptitude Tests 


Tests designed to discover whether a student is “gifted” in 
music or mathematics, say, or whether a young man has the 
knack for dealing with tools and mechanical contrivances are 
called aptitude tests. Aptitude тау be inferred (1) from the 
degree of mastery attained in a “new’ ` subject after a period of 
study. Aptitude for a foreign language, for instance, is demon- 
strated in the case with w hich the subject (Spanish, for example) 
is acquired after a term's work. Achievement tests, given after a 
period of “exposure,” reveal this aptitude directly. Aptitude is: 
also inferred before a period of study by testing (2) to see 
whether an examinee possesses those abilities and skills judged 
to make for success in a given subject (for example, physics), 
orina profession (for example, medicine or law). Aptitude fo. 
physics is gauged by finding how well the student has learned 
the mathematics necessary for work in physics; aptitude for law 
is judged by the student’s ability to read difficult prose, compre- 
hend fairly involved legal. arguments and follow a line of 
reasoning to a conclusion. What are called “differential aptitude 
tests" are designed to assess a student's strengths and weaknesses 
in certain fundamental abilities believed to be crucial in a 
number of activities—in and out of school. 

Tests of general mechanical aptitude sample performance in a 
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number of activities believed to demonstrate mechanical knowl- 
edge and skill. Factors measured by these tests include familiarity 
with tools, insight into mechanical relations (pulleys, levers, and 
the like), ability to solve problems expressed in diagrams of 
machines and mechanical contrivances, and interest in mechan- 
ical things, as shown by the reading of popular science, 
radios, tinkering with cars and so on. 
mechanical gadgets have been employed to test for special 
abilities in a variety of situations, Among the traits studied are 
manual dexterity, sensory-motor skills, visual and auditory 


acuity, all of which are needed in many jobs in industry and in 
the armed forces, 


Clerical a 


building 
Manipulative tasks and 


Ў ptitude tests cover the knowledge and skills needed 
in а business office. Tests under this head provide scores from 
which we can predict an examinee’s ability to carry out the 
wnitten work of an office—to spell, check records, read and 
write easily and accurately. : 

Aptitude tests of а special sort have been devised for inferring 
talent in art and music. In music, for instance, many of the 
factors needed for Success can be measured: 


rapid and accurate reading of music at sight, and knowledge of 


harmony and other technical phases of music. I 
color, form, s 


“ear” for music, 


mined by comparing a student’s judgments with those of 
сок edged experts. Knowing whether a person possesses 
talent 


have been used in th cio-economic. 
home, and communi í 


“Tests” of personality are 
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in reality standard interviews designed to reveal characteristic 
ways of behaving. The personal adjustment questionnaire or 
personal data sheet inquires into a person’s fears, worries, 
anxieties, and home and work adjustments. Such inventories are 
often appropriately called “trouble sheets.” In some cases, the 
questions are direct and undisguised: “Are you afraid of high 
places?” “Can you stand the sight of blood?” “Do your parents 
treat you right?” In other adjustment i inventories, questions are 
disguised and indirect, so that the intent of the question may not 
be understood by the examinee. A technique often used in such 
inventories is that of “forced choices” (p. 168). 

Attitude questionnaires attempt to reveal systematic ways of 
behaving or thinking about social, religious, or political matters. 
Can a student be classified as narrow- or broad-minded, religious 
or irreligious, or somewhere between these extremes? Attitude 
inventories try to answer these questions, 

Interest inventories survey a person’s interests:in books, sports, 
people, occupations, social activities, and the like. An examinee’s 
pattern of interests may serve to identify him with some well- 
defined occupational group—for example, lawyers or chemists. 
Or a young man’s interests may identify him with some area of 
interests, such as science, business, or social service. Interest tests 
are especially valuable in counseling, since interest, as much as 
ability, may determine a student’s educational or vocational 
choices. 

Another group of personality tests makes use of what has been 
called “projective” techniques. Projective tests are disguised 
interviews in which an examinee is asked what he “sees” in some 
neutral situation—an ink blot or a picture, for example. These 
tests are perhaps most useful in the diagnosis of disturbed mental 
states. They must be administered by an expert and are employed 
mostly by psychiatrists and clinical psychologists in severe be- 
havior problems. 

The techniques of the personality questionnaire have been 
widely used in polls conducted to assess public opinion about 
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such things as political issues and social questions. Inventories 
have been employed, too, to Survey systematically the association 
between items in a constellation of attitudes or opinions—for 
instance, between social and economic background factors, 
preferences for political candidates, etc., and views about social 
and economic issues. In sociological studies-in which environ- 
mental factors loom large, the kind of home from which a child 


comes, the educational and Occupational status of the parents, 


and the character of the community may be revealed by a 


Systematic survey of background 


variables. Personality tests are 
treated in Chapter 7. 


SUGGESTIONS FOR FURTHER READING 


Comprehensive accounts of the develo, 


теле pment of mental testing and of: 
the application of tests in various areas 


will be found in the references 
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gical Testing. New York: Macmillan, 1954, 


of Psychological Testing. (Rev. 
ment in Today’s Schools, (3га 


Measurement and Evaluation 
Wiley, 1955. 
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СНАРТЕК 2 


STATISTICS IN MENTAL TESTING 


The purpose of this chapter is to acquaint the prospective user 
of mental tests with those statistical terms and techniques most 
often used in testing. Stress throughout the chapter is on the 
meaning and significance of symbols and terms rather than on 
the mechanics of computation. For the latter, the student should 
consult the Appendix as well as the books on statistical method 
listed at the end of this chapter. 

Perhaps the best advice one can offer the teacher who is plan- 
ning to use mental tests is that he first take a course in statistics. 
For students who have been wise enough to do so, the present 
treatment will constitute simply a bricf review and summary. 
And for those who have had no statistical training, it will pro- 
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vide the minimum essentials for the understanding and evalua- 
tion of mental tests themselves. 


THE FREQUENCY DISTRIBUTION 


Drawing Up a Frequency Distribution 


Suppose that a teacher has administered a test of English 
grammar to fifty children in the seventh gr: 
been marked and the names and scores of t. 
"Two questions ordinarily arise: 
formance of the class 
the class? 


ade. The papers have 
he children recorded. 
(1) What is the typical per- 
, and (2) What is the range of talent in 
To answer these questions, we may organize and pre- 
sent the fifty scores in one of several ways. 

Table 2-1 is a systematic tabulation of the fifty English 
grammar scores into what is called a frequency distribution. 

The fifty scores have been arranged from high to low into 
sets of five under the heading “Scores.” In the frequency column 
headed “f” are listed the numbers of scores which fall into each 
sub-group. For example, five children score in the interval 60-64. 
eight in the interval 55-59, and so on down to four who score 
in the bottom interval, 30-34, 


A test score is always taken to represent the distance along 


TABLE 2-1 


Frequency Distribution of Fifty Scores 
On a Test of English Grammar 


Scores 1 
60 — 64 5 
55 — 59 8 
50 — 54 10 
45 — 49 12 
40 — 44 ` 6 
35 — 39 5 
30 ~ 34 4 


N = 50 


| 


Graphic Representation of the Frequency Distribution 15 


some scale of ability running from low to high. Thus, a score 
of 46 covers the span from 45.5 to 46.5, 46.0 itself being the 
middle of the score interval. Other scores have the same meaning: 
in each case the score covers the distance .5 unit below to .5 
unit above the face value of the given score. This definition of a 
score means, of course, that the interval 30-34 begins at 29.5 
and ends at 34.5, that interval 35-39 begins at 34.5 and ends at 
39.5, and so on. For convenience in writing, the intervals in 
Table 2-1 are the score limits rather than the exact limits. In each 
case, however, the exact limits of the intervals are understood. 


Graphic Representation of the Frequency Distribution 


A frequency distribution may be represented graphically by 
a frequency polygon, as shown in Figure 2-1. In the construction 


FIGURE 2-1 Frequency Polygon of Fifty Scores Achieved by 
Seventh-Grade Children on a Test of English Grammar 


y 
(Frequencies) 


(Scores) 


scores are laid off along the baseline, or 
s, and the frequencies (f’s) are plotted 
Each f is plotted directly above the 
n which it falls. The four scores 
30-34, are plotted above 32, the 
he other intervals (reading up), 5 


of a frequency polygon, 
X-axis, at equal interval: 
on the vertical or Y-axis. 
midpoint of the interval upo 
falling in the first grouping, 
midpoint of the interval. In t 
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scores are plotted above 37, midpoint of 35-39, 6 above 42, 


12 above 47, and so on. The points are joined with short straight 


lines to give the outline of the frequency polygon. 

A frequency polygon shows graphically how the scores are 
spread over the test scale from low to high. From Figure 2-1 it 
is apparent that more children scored in the middle of the scale 
(see, for example, the 12 on interval 45-49) than. at either 
extreme. Rules for constructing a frequency polygon so as to 
provide a good picture of the test data will be found in the 
Appendix. 

Another way of representing a frequency distribution graph- 
ically is the histogram. Figure 2-2 represents the f’s on the score 


FIGURE 2-2 Histogram of Fifty Scores Achieved by Seventh- 
Grade Pupils on a Test of English Grammar 
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intervals by small rectangles set up over each interval. For the 

first interval, the rectangle is four Y-units high, and for the 

second interval five Y-units high, and so on. The highest rec- 

tangle, 12 units on the Y-axis, is above interval 45-49. 

ae dcn and frequency polygon represent the same 
» and there is little to choose between them. Frequency 

polygons are to be preferred to histograms when two distribu- 


tions i i i 
are plotted on the sare axes, since in the histogram the 


~~ 
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vertical and horizontal lines often coincide, making the figures 


difficult to disentangle. 


The Normai Curve 


The symmetrical bell-shaped graph shown in Figure 2-3 is the 
well-known normal curve. This "ideal" frequency polygon is 


FIGURE 2-3 The Normal Curve 


the mathematical model to which many distributions of actual 
scores approximate. (See, for example, Figure 2-1.) The normal 
curve is often called the normal probability curve because it 
shows the probability of occurrence of scores of different size, 
when these are determined by a large number of independent 
and randomly combined factors. 

The normal curve has played an important role in the develop- 
ment of mental measurement. Among its uses in testing may be 


mentioned the following: 


1. Selecting the Items of a Test. When the distribution of test 


scores for a class is badly off-center or “skewed,” as shown in 


Figures 2-4 and 2-5, the test is not suitable for the group. In 
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Figure 2-4 the test is too easy—there are too many high scores; 
and in Figure 2-5 the test is too hard—there is a disproportionate 
number of low scores. When the test maker takes the normal 


FIGURE 2-4 Negatively Skewed Curve 


FIGURE 2-5 Positively Skewed Curve 
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number scoring at the high and low ends. Note that, according 
to the criterion of normality, the frequency polygon in Figure 
2-1 shows the English grammar test to be generally satisfactory, 
though perhaps a bit too easy. 


2. Scaling the Obtained Scores from a Test. Raw or obtained 
scores from a test are usually expressed by an arbitrary number 
of points. Scores of this sort do not represent equal steps or equal 
units along some ability scale; and since there is no zero point, 
а score of 40 is not twice as good as а score of 20. When point 
Scores are transformed into deviations from the average or mean, 
and expressed in units of the standard deviation (page 36) of 
the group, they are called sigzza-scores. The unit of deviation 
(the standard deviation) is usually represented by the Greek 
letter c (sigma). Sigma-scores may later be converted into 
standard scores (page 38). Many educational achievement and 
aptitude tests publish norms (page 40) in terms of standard 
scores. These scores are comparable from test to test when dis- 
tributions are normal, or approximately so. 

Point scores may be changed over directly into equal-unit 
Scores in a normal distribution. Such "normalized" scores have 


several advantages (page 40). 


3. Determining the Stability of a Test Score. An obtained score 
on a test—for example, a group test IQ—can be expected to 
vary somewhat up or down when the test is administered a 
second time. The variation to be expected in a score, that is, its 
probable stability, can be predicted from tables of the normal 


probability curve (page 23). 
AVERAGES 


After a frequency distribution has been tabulated, we are 
ready to compute a typical measure or average. There are three 
sorts of averages—also called measures of central tendency—in 


common use. 
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The Mean (M) 


Given a set of ten scores, 10, 9, 10, 12, 8, 6, 4, 7, 5, and 4, the 
mean is simply 7.5—found by adding the scores (75) and dividing 
this sum by their number (10). The M is popularly called the 
average. When scores have been grouped into a frequency dis- 
tribution, as shown in Table 2-1 on page 14, a slightly different 

method is employed in finding the M. (See Appendix.) But the 


M is always essentially the sum of the scores divided by their 
number. 


The Median (Mdn) 


When scores are arranged in order of size, another sort of 
average, the median (Mdn) is the point in the distribution’ found 
by counting off one-half of the scores from either end of the 
series. We usually start with the low end. For example, for the 
five scores, 7, 8, 9, 10, and 12, the median or mid-score is 9: there 
are two scores above and two below it. When the number of 
Scores is even—for example, 5, 7, 8, 9, 10, and 12—the median 
is midway between the two middlemost scores, namely, at 8.5. 
There is no mid-score. When scores are grouped into a frequency 
distribution, as shown in Table 2-1 on page 14, the median is 
still the 50 per cent point—the point found by counting 50 per 
cent of the way into the distribution. For a method of 


computing 
the median, see the Appendix. 


The Mode 


That score in a set of scores which occurs most frequently is 
called the crude mode, or the modal score. The crude mode is a 
third sort of average. In Table 2-1 the crude mode is taken at 47,. 
midpoint of the interval which contains the largest frequency. 
The mode can be computed more exactly, but usually we simply 
take the most often recurring score as the crude mode without 
further refinement, In -most cases, the mode is a preliminary 
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measure of central tendency. For exploratory purposes it does 
not need to be computed so precisely as the mean or median. 


MEASURES OF VARIABILITY 


The Range 


It is sometimes more important to know the variability of a 
set of scores than to know the mean or median. Suppose, for 
example, that two sections of Grade 7 have the same mean but 
differ markedly in spread of talent, as evidenced in the variability 
of scores around the mean. Figure 2-6 shows two distributions 


FIGURE 2-6 Two Distributions with the Same Mean but 
Differing Markedly in Range (Variability) 


of this sort: the scores in A range from 40 to 60, whereas the 
scores in B range from 20 to 80. The difference between the high 
and low scores in the A distribution is 20 points; in the B distribu- 
tion 60 points. The range is the most general index of variability, 
Other more exact measures are the standard deviation (written 
as SD or c) and the quartile deviation (written as Qka. 
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The Standard Deviation (c) 


The mean of the set of five scores—12, 10, 8, 6, and 4—15 8. 
If 8 is subtracted from each score, we have 12 — 8 —4,10— 8 = 
2,8— 8 =0, 6 — 8 = —2, and 4 — 8 = —4. The size of a 
deviation from M tells the extent to which the individual score 
deviates from the common mean; and the sign of the deviation 
indicates its direction from M. If each deviation is now squared, 
we have 4? — 16, 2? — 4, (— 2)? — 4, and (— 4)? — 16. The 
square of 0 is, of course, 0. The sum of these squared deviations 
is 40, and o, the standard deviation, is defined as 


А =, (deviations) 


or, in our example, с = \/40/5 = V8 
=283* 


Squaring the separate deviations around the M eliminates the 
minus signs and gives extra weight to extreme deviations. A SD 
ого is judged to be large or small (to reflect much or little varia- 
tion) in relation to other SD's computed for the same test. For 
example, if 35 boys and 42 girls have the same М on a history test 
but the boys’ о is 10 and the girls’ o is 6, we know that the boys’ 
scores spread more than the girls’ up and down the scale—in 
both directions from the mean. 

: In a normal curve, с provides valuable information concern- 
ing the way in which the separate measures fall around the 
common ;nean. In Figure 2-7, for example, 3o is seen to include 
virtually all the nieasures above the M, and —3o all of the meas- 
ures below the M. The total area of the normal curve is taken 
as N. From tables of the area of the normal.curve, we know that 
between M and 1c are approximately 34 per cent of the measures 
(actually 34.13 per cent); and between M and —1с are also 
34 per cent of the measures. The two “halves” of the curve are 
equal. Hence we find about 68 per cent of the measures—roughly 


* For calculation of с from а frequency distribution, see Appendix. 
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FIGURE 2-7 Areas Under the Normal Curve 
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two thirds between M and +10. Furthermore, from tables we 
find that 14 per cent of the measures fall between 1c and 2c in 
the normal curve and about 2 per cent between 2c and 3c. The 
same proportions hold, of course, for the half of the curve to 
the left of the M, since the M divides the area of the normal 
curve into two equal parts. 

The relations of o to the total area (N) in the normal curve 
model hold pproximately for distributions which resemble the 
normal curve in form. An illustration will make clear how the 
normal curve model is used in such cases. Suppose that on a 
reading test administered to sixty children in the fifth grade, 
the M = 62 and © = 8; and suppose further that the frequency 
polygon of these scores closely resembles the normal curve in 
form. Taking the normal curve as our model, then, we can say 
that approximately two thirds of the scores (that is, forty) fall 
between 54 and 70 (62= 8). Moreover, about 14 per cent of 
the scores, or about 8, will fall between 70 and 78 (between 
1c and 20), and about 2 pericent or 1 or 2. will fall between 
78 and 86 (that is, between 20 and 3c). In the lower half of 
the distribution, 14 per cent or 8 scores, will fall between 54 


24 > Statisties in Mental Testing 


and 46, and 2 per cent between 46 and 38. These relationships 
are shown in Figure 2-8. Note that M, the reference point 
is 62 and that c is 8 units on the test scale. 


9 


FIGURE 2-8 Use of Normal Curve Model to Show Distribution 
of Sixty Scores on a Reading Test 


The Quartile Deviation (Q) 


Just as we compute the median by counting off 50 per cent 
of the scores, so we can count off 25 per cent of the scores 
from the low end of the distribution (that is, 25 per cent of 
N) to locate Qu, the first quartile point. Similarly, we can count 
off 75 per cent of the scores from the low end of the distribution 
to locate Qs, the third quartile. The gap between Qs and Qi 
is called the interquartile range, or range of the middle 50 per 
cent. Q, the semi-interquartile range, is computed thus: 
ue o, Qis 2 measure of variability but, unlike er, it is found 

у counting into the distribution, whereas c is computed from 
the squared deviations taken around the М. When the median 
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is the measure of central tendency, we generally use Q; when 
M is the measure of central tendency, we use c. 

Methods of computing Q from a frequency distribution will 
be found in the Appendix; here we are concerned primarily 
with the meaning of Q as а measure of variability. Q’s useful- 
ness will become clearer when we have computed the percentile 
curve, or ogive, as shown in the next section. 1 

4 


PERCENTILES AND PERCENTILE RANK 


Table 2-2 shows the frequency distribution of Table 2-1, 
with the addition of two columns in which the f’s have been 
cumulated. 


TABLE 2-2 


Frequency Distribution and Cumulated Frequencies of 
Fifty Scores on an English Grammar Test 
Data are the fifty scores in Table 2-1. 


(1) (2) G) (4) 
f 


cum.f % cum.f 


In column (3), scores have been added progressively—cumu- 
lated—from the bottom to the top of the distribution. On the 
first interval, 4 is the entry, 4 + 5 on the next interval gives 9; 
9 + 6 on the third interval gives 15; and so on. In column (4), 
these cumulated scores are expressed as percentages of N. In 
Figure 2-9, cumulated f's, in percentages have been plotted 
against the score-intervals laid off along the baseline. As scores 
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FIGURE 2-9  Oggive or Cumulative Frequency Curve 
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are added over each interval, each %cum.f is plotted just above 
the upper limit of the interval upon which it falls. The resulting 
S-shaped curve is called an ogive, or cumulative frequency. 
graph. The ogive constricts or expands the scale of scores into 
a scale of one hundred points, called a percentile scale. The 
median and the О” can be read from the ogive almost as accu-- 
rately as they can be computed from a frequency distribution. 
To illustrate, if a line is run from the 50 per cent point on the 
Y-scale across to the curve, a perpendicular dropped from this 
point to the score-scale locates Mdn at 49 approximately. (The 
computed value is 48.66.) The twenty-fifth percentile, or Q1, 
is located from the ogive at 42 approximately; and the seventy- 
fifth percentile, or Q3, at about 55. Other percentile points (for 
example, Pss-or Pes) can be located in the same manner by going 
from the appropriate point on the vertical percentage scale 
across to the ogive and dropping a perpendicular to the base- 
line. Note that the distance from Q3 to Q1 (that is, 55-42) is 
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the interquartile range or range of the “middle 50.” One-half 
of this distance is 13/2, or 6.5, which is the quartile deviation, 
or Q. The larger the Q of a distribution, the greater the spread 
of the middle 50 per cent of scores along the scale and the larger 
the variability. : 

A. pupil's percentile rank (PR) is the position on the per- 
centile scale (on a scale of one hundred points) to which his 
score entitles him. Suppose that Tom Brown achieves a score 
of 40 on our English grammar test. What is his PR? Going out 
to 40 on the score-scale on the baseline, up to the curve, and then 
across to the Y-scale, we locate Tom's PR at about 20. "This 
PR tells us at once that about 20 per cent of the pupils scored 
lower than Tom. If Mary Green scores 58 on the grammar test, 
her PR is read at approximately 84—and 84 per cent of the 
class made lower scores than she did. Scores achieved on tests 
expressed in different units—for example, a reading test and an 
arithmetic test—cannot be compared directly. But relative posi- 
tions (PR’s) of a child in his classes can be quickly determined 
and compared when both sets of scores have been converted 
into a common percentile scale. Moreover, several PR’s may be 
combined to give a general index. 


CORRELATION* 


The relationship between two sets of test scores can be de- 
scribed mathematically by the coefficient of correlation between ` 
them. Correlation 1s expressed by a decimal fraction (called r), 
which may vary along a scale from .00 to +1.00. Let us suppose 
that tests in English grammar and in history have been admin- 
istered to the same seventh-grade class. Suppose further that 
children who score high in the English test tend to score high 
in history, and that children scoring fairly high or quite low 
in English tend to score fairly high or quite low in history. When 


this happens the coefficient of correlation between the two 
9 


i 1 lation coefficient. 
* Sce Appendix for computation of a corre 
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sets of scores will be marked or substantial, for example, 60 
to .70. Now suppose that most pupils who score high in English 
grammar score only average in arithmetic, The correlation be- 
tween these two areas would then be lower—perhaps no more 
than from .20 to .30. If those pupils who score high in English 
grammar tend to earn very low scores on a test in shop work, 
the correlation here would be close to zero, or perhaps negative. 

Positive coefficients of correlation run from .00 to +1.00; 
good scores in the one test go with good scores in the other. 
Negative coefficients of correlation run from .06 to —1.00— 
denote inverse relationship—and good scores in the first test go 


with poor scores in the second. Zero correlation denotes just 
no correlation between two variables. 


Whether a correlation coefficient is to be re 
low depend 


height with 
70 fora gi 
test and scl 


garded as high or 
he correlation of 
ally high—around 
a good intelligence 


cement are usually low and often nega- 
tive. The following table will aid in interpreting coefficients of 
correlation: 


75 from  .00to-- .20 very low; negligible 
7's from 20 to = 40 low; present but slight 
rs from +.40 to + 70 substantial or marked 
75 from =.70 to +1.00 high to very high 
When computing the correlatio 
same test (the self-correlation of 


n between two forms of the 
higher 7's than are found typicall 


the test), we demand much 
y between different. variables. 
The Reliability of-a Test 


a 
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If a child achieves a score of 48, for example, on a highly reliable 
test of general science, subsequent scores earned by this pupil 
upon equivalent forms of this test should not differ greatly from 
the initial score of 48. But if the test is unreliable, in repeated 
testing the score may vary widely from its first determination. 

The reliability coefficient of a test is found by computing the 
self-correlation of the test. Suppose that a reading examination 
has been given to five sixth-grade classes and that two weeks 
later the same test or an equivalent form is administered to the 
same classes. If the correlation between these two administra- 
tions of the test is high (a reliability coefficient of .90 or more 
is considered high), we may feel confident that scores earned . 
by pupils in this class are reasonably accurate measures of “true” 
ability. 

Test reliability is sometimes determined by repeating a test 
and correlating the second set of scores against the first set. This 
method is followed when there is only one form of the test. 
More often, an equivalent or parallel form of the test is given, 
and the reliability coefficient is the r between the test and its 
alternate form. The reliability coefficients of many standard edu- 
cational tests have been determined in this way. Other ways of 
determining test reliability will be found in the references at the 
end of the chapter. The authors of standard tests. will usually 
specify what method has been used in computing the reliability 
of their tests. 


The Standard Error of a Score 


The accuracy or precision of an individual score is perhaps 
best expressed by the standard error of a score, which is also 
called the standard error of. measurement. The SE (standard 
error) is calculated from the following formula: 


SE (score) = o V 1 — fır 


where o is the standard deviation of the test scores and ri: is 
the reliability coefficient of the test. Suppose that the o of a 
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set of test scores is 10 and the reliability coefficient (ти) is .95. 
Then the SE of a scoie on this test is SE = 10 V 1 — .95 or 22. 
This may be interpreted to mean that should a child take this 
test a second time, the chances are good (about 7 in 10) that 
his “new” score ~il] not diverge by more than + 2 points from 
the true determiuation. The SE of a Stanford-Binet IQ is 4.5 
points for IQ’s from 90 to 110. In other words, if the test is re- 
peated, we can expect a child’s IQ to stay within 4 to 5 points 
.of its true value. 

Reliability coefficients of standard intelligence and educa- 
tional achievement tests are generally above .90 for large groups 
of pupils. The size of the reliability coefficient depends upon 
several factors: the variability of the group, the length of the 
test, the method used in determining reliability. A reliability 
coefficient of .50 in a single grade or class may indicate as much 
stability of score as a reliability coefficient of .90 in a large 
group. The great advantage of the SE of a score is that it takes 
account of both the reliability coefficient azd the variability (SD) 
in the group. (See page 56.) 


The Validity of a Test 


A mental test is a valid testing device when it measures what 
it claims to measure. Tests are not valid for all areas and all 
situations, but are valid in certain defined situations and for 
certain behaviors. A group intelligence test, for example, is not 
a valid measure of emotional control or of delinquent behavior. 
Validity may be classified, for convenience, into three sorts: 
experimental, content, and predictive. 'The validity of an intelli- 
gence test is determined experimentally by computing the test's 
correlation with various criteria: school grades, ratings for mental 
alertness, and other measures of intellect, to mention a few. 
Many of the best tests of general intelligence have been vali- 
dated against the Stanford-Binet, the best known .individual 
intelligence test (page 47). Aptitude tests—for example, those 
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of clerical and mechanical aptitudes—are validated against dem- 
onstrated proficiency in office work or in mechanical tasks. Inde- 
pendent measures against which tests are. validated are called 
criteria, Criteria do not represent entirely adequate or com- 
pletely sufficient determinations of a trait. Insofar as criteria 
incorporate valuable aspects of the behavior we are studying, 
however, they represent variables with which a test, to be valid, 
must correlate positively. 

Tests of educational achievement in history, mathematics, 
languages, and the like possess content validity in that test 
questions sample the subject matter areas directly. Content valid- 
ity is not alone a sufficient index of a test's usefulness. Such 
considerations as choice of,items, extent of sampling, form in 
which items are put, and level of difficulty are also very im- 
portant. But content validity is a necessary first step. Intelligence 
and aptitude tests possess content validity insofar as the items 
in them fulfill the author's definition of what he is measuring. 
Such asserted or “face” validity, however, is never as convincing 
as is the content validity of the educational achievement test. 
Generally, tests of intelligence and of aptitude must depend for 
their validity upon correlations with independent criteria judged 
to be dependable indices of the trait under study. 

Predictive validity is the degree to which a test battery is related 
to some criterion of future performance or measure of success 
which will become available in the future. The predictive validity 
of a good group intelligence test for school performance ranges 
from about .40 to .60. (See page 96.) Many short tests have low 
correlations with a criterion, but when put with other tests 
into a team combine forces to raise the correlation of the battery 
with the criterion. Validity coefficients do not run as high as do 
reliability coefficients, since no test can correlate higher with 
other tests than with measures of itself. 

Personality questionnaires, interest blanks, and attitude scales 
have content validity insofar as choice of items is concerned. 
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Such instruments are usually validated experimentally against 


objective expressions of interest, indices of neurotic behavior, 
and the like. 


Practical Considerations in the Choice of a Test 


There are a number of factors which enter into the choice 


of a mental test besides validity and reliability. Some of che 
more important are the following: 


1. Appearance: Is the test format 
tively presented and arranged? 

2. Administration: How much time is 
score the test? What is the cost? 

3. Manual: Does the author give full accounts of reliability and 
validity—how found, upon what samples, of what sorts? 
Are instructions clear? 

4. Norms: Are the test norms readily interpreted? Are age and 
grade equivalents given? What type of scaling is used? 


good—are the items attrac- 


required to give and 


SCALING TEST SCORES 


g is (1) to revamp the raw test scores 
nits, and (2) to enable us to combine 
dex. It is sometimes important (especially 
compare relative performances, and this 


Taw scores, 


The Age Scale 
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represents the performance of the average child who is 9 years 
and 6 months old. Thus if Dick achieves an MA of 9-6 on the 
Stanford-Binet, this mental age is a measure of his intellectual 
status or degree of mental growth. 

If Dick’s life age (CA) is 10-4, his IQ is 92. IQ = MA/CA 
and in our example is 114 mos/124 mos; the decimal is dropped. 
IQ is a measure of a child's brightness relative to that of other 
children of his age. When МА and CA are equal, the IQ is 100, 
the brightness index of the average child. Dick's IQ of 92 means 
that he is somewhat less bright than the typical child of his age 
level. IQ's above 100 are achieved by bright children—those 
whose mental growth runs ahead of their years. IO's below 100 
indicate that a child is below normal, and very low IQ's (70 or 
below) imply fecblemindedness. 

"The age-scale is used in most individual intelligence scales and 
by many group tests of general intelligence. The MA and IQ 
were first widely used to measure performance on the Stanford- 
Binet test, which was constructed so as to meet the requirements 
necessary to yield a constant ratio score, or IQ. Many group 
tests do not meet these requirements. It is wise, thercfore, to 
accept IQ's from group tests as tentative indices of brightness 
not always closely related to IQ from the Stanford-Binet. 


The Percentile Scale 


We have already seen (page 25) how obtained scores can be 
fitted into a scale of one hundred units to yield a percentile 
scale. The PR (percentile rank) of a score—its position on the 
percentile scale—can be computed from the frequency distribu- 
tion of scores. But the simplest plan is to plot an ogive (see Figure 
2-9, page 26) and read the PR from the graph. The PR of 
any score then becomes the percentage of the distribution which 
lies below the score. This method is not accurate beyond the 
first decimal, but it is sufficiently precise for many purposés. It 
is easy to apply and requires a minimum of calculation. Table 2-3 


gives the frequency distribution of 180 scores on a clerical 
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aptitude test earned by students enrolled in several courses in a 
business college. 


TABLE 2-3 


Frequency Distribution of 180 Scores Achieved on a 
Clerical Aptitude Test 


PR's of 
cum.f % cum.f midpoints 


Scores Midpoints 
194 — 196 
191 ~ 193 
188 – 190 
185 – 187 
182 — 184 
179 — 181 
176 - 178 
173 - 175 
170 - 172 


2 The ogive in Figure 2-10 has been constructed from the 
/ cum. f's in Table 2-3 following the method outlined on page 
26. In the last column are entered the PR’s of the midpoints of the 
successive score-intervals, The midpoints reading down are 195, 
192, 189, 186, 183, 180, 177, 174, and 171. Any student who 
Carns a score of 191, 192, ог 193. who falls, that is, in the interval 
Next to the top—receives a PR of 95 , the PR of the midpoint of 
this interval. These midpoint PR’s constitute norms for the test. 


, . . 1 
PR’s can be read with considerable accuracy from the ogive. 
PR's have sever. 


im al advantages over raw scores. Suppose that 
à child has taken tests in arithmetic, science, English, history, and 


spelling. If his PR's in each of these tests are known, they can 
be represented co à 


2n mparatively on a profile as shown in Figure 
ы graph permits а comparison of the child’s achievement in 

€ five subjects, It is clear that he is satisfactory in arithmetic 
and Science (PR = 55), above average in history 
average in English (PR = 50), and below aver- 
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FIGURE 2-10 Cumulative Frequency Polygon (Ogive) of 180 
Scores Achieved on a Clerical Aptitude Test 
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FIGURE 2-11 Profile of tbe Percentile Ranks in Various Subjects 
for a Given Child 
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age in spelling (PR = 45). Comparisons of this sort cannot be 
made from raw scores. One disadvantage of the PR scale is the 
fact that units are not equal at the extremes of the scale. When 
PR's are below 20 or above 80, they must be compared (or 
combined) with caution (see refs.). 


Sigme-sceres and Standard Scores 


We have seen that one way of converting raw scores into 
а scale is by means of percentile ranks. Another method of 
scaling is to express the deviation of each test score from the 
common mean in units of SD, thus putting all scores into c-units. 
Such "deviation" scores are called C-scores and sometimes z- 
scores. The following is an illustration of the method of con- 
verting obtained scores into o-scores, ^ 

Table 2-4 gives the M’s and o’s earned by fifty sixth grade 
pupils on five objective educational achievement tests. At the 


ted the scores achieved by two children, 


bottom of the table are lis 
Mary and Howard. 


E TABLE 2-4 


M's and o's Earned on Five Objective Tests of Educational 
Achievement Civen in the Sixth Grade 


(1) Arith, (2) Arith. 
Reas. Comp. (3) Reading (4) Grammar (5) Science 
Mean 


62 124 4 28 46 
c | 10 20 7 4 8 
Mary’s scores 57 119 50 31 36 
Howard’s scores 62 144 41 26 


From an Inspection of these Scores, it is clear that Mary is 
below the class in ari i 


al reasoning, arithmetical com- 
putation, and sci 
t » On the other hand, i 
metical reason 


and slightly be 
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parisons are useful, but because of differences in the units in which 
test scores are expressed, we cannot (1) compare Mary's and 
Howard’s scores in the several tests, except to point out that 
they are above or below the mean, nor (2) combine either pupil's 
scores into a single meaningful index of academic achievement. 
Conversion of test scores into c-units will permit us to carry 
out both these operations. 

The formula for а o-score or z-score is 
(X — M) x 

z= ———- oarz= 

c 


where (X — M) — x. Mary earned a score of 57 in arithmetical 
reasoning. This score deviates —5 points from the mean 
(57 — 62 = —5). If we divide this deviation of —5 by 10 (the 
с), we have —.50 as Mary's o-score in arithmetical reasoning. 
In Test 2, arithmetical computation, Mary's o-score is (119 — 
124)/20 or —.25. Her other o-scores are computed in the same 
way: those that are plus are above the mean, those minus below 
the mean. Mary's five o-scores are shown below: 


Test: (1) (2) (3) (4) (5) 
Mary’s o-scores —.50 —.25 1.00 75 —1.25 


Howard’s o-scores are found as were Mary's. In Test 1, his 
score of 62 is exactly on the mean, and his o-score is .00. In 
Test 2, arithmetical computation, Howard's o-score is (144 — 
124)/20 or 1.00. Howard's scores are below the mean in tests 
3 and 4, and his o-scores are minus. His scores are tabulated 
below: 

Test (1) (2) (3) (4) (5) 
Howard’s o-scores .00 1.00 —.29 —.50 38 


It is apparent that o-scores are simply plus or minus deviations 
from the test mean expressed in © units. A practical disadvantage 
of such scores is the fact that they are small decimal fractions 
and are about as often + as — For greater convenience, there- 
fore, G-scores are usually converted into a distribution of 
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standard scores with an assigned М and o. M’s and o’s often 
selected are M = 100, с = 20, M = 500, от = 100, M = 10, 
с = 3. 4 

If Mary's and Howard's scores are converted into a standard 


score distribution with M = 100 and o = 20, we have the 
following: 


‘Tests (19 (0) (3) (4) (5) Total Mean 
Mary's standard 


Scores 20) oS 205 1I5 75: 495 90 
Howard’s standard 
scores 100 120 94 оо 108 512 102 


In the first test, Mary's o-score is —.50, or —.50 of с below the 
М. In our new distribution (M = 100, = 20), the equivalent 
standard score is one-half of o below 100, or 90. In Test 2, 
Mary's standard Score is 4 of o below the mean of 100 or 95 
(% of 20is 5). А formula for converting obtained scores directly 


into standard Scores with a M = 100 and ao = 20 is the 
following: 


д2 = (X — M) + 100 
in which X' — Standard score in the “new” 
x = original or raw score 
— mean of the raw score distribution / 
100 and 20 are the M and o of the new 
9 —SD ofthe original or raw scores 


distribution 


distribution 


Substituting for Mary's raw Score of 50 in Test 3, we have 


X' —20/7 (50 — 43) 4 100 
0 


= 12 


] - In Test 4, for example, Howard's 
Score is 26 ang from 

X= 20/4 (26 — 28) + 100 
= ow 0-F TOO OL 90 
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The formula will convert апу pupil’s raw scores into standard 
scores when the M of the standard score distribution is 100 and 
c is 20. \ 

When put in standard score form, Mary’s and Howard’s scores 
can be compared directly; and the five scores of each child can be 
combined with equal weights. On the five tests, Mary’s average 
is 99 and Howard’s 102. 


The IQ as a Standard Score 


When raw scores are converted into standard scores in a dis- 
tribution with a mean of 100 and a с of 15, these new scores 
are often called “deviation IQ's" (page 36). In the Wechsler- 
Bellevue Intelligence Test, for example, IO's are determined by 
this method. A Wechsler-Bellevue IQ of 115 is 1c above the 
mean of the group; an IQ of 85 is —1o below the mean of 
the group. 

A. general formula for transforming obtained scores into 
standard scores with any given mean and о is 


х=” (х—м)+ М 


where 

X’ = standard score in new distribution 

X = obtained score (usually in points) 

о” = SD of the new distribution 

с = SD of obtained score distribution ` 

М' = M of standard score distribution ` 
. . Pal 

M = M of raw score distribution 


This formula may be used to compute deviation IQ’s. Suppose 
that Arthur J., a veteran 32 years old, earns a'score of 86 on an 
intelligence test for which the mean of his age-group is 80 and 
the c is 10. What is Arthur J.’s deviation IQ? Substituting in the 
formula, we have 

X’ = 15/10 (86 — 80) + 100 
= 109 
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The formula is useful when we wish to convert the sub-tests of 
2 battery into comparable unns which may be combined into a 
single score. 


Normalized Standard Scores, or T-Scores 


When raw point scores are transformed into PR's and the 
resulting PR's are converted into equivalent "scores" in a normal 
distribution, the final scores are said to be "normalized." If the 
normal scaling distribution into which the scores are converted 
has an M = 100 and с = 10, the normalized scores are called 
T-scores, Converting raw scores into T-scores can be easily 
done with the aid of tables prepared for this purpose. First, the 
PR’s of the scores (or of the midpoints of the successive inter- 
vals) are read from an ogive. The T-scores (normalized scores) 
corresponding to these PR’s are then read from tables. T-scores 
range theoretically from 0 to 100, practically from about 15 to 
85. The method of computing T-scores for a given distribution 
will be found in detail in the references. 

For several reasons, T-scaling is theoretically the soundest 
method of converting raw scores into an equal-unit scale. Many 
of the widely used educational achievement tests make use of 
some variety of T-scaling. T-scores can be added or averaged; 


they have the same meaning and denote the same relative 
achievement. 


NORMS 


Norms are scores which are typical or characteristic of pupils 
of a given age or grade. To provide comparable norms, scores 
on group tests are expressed in PR's, standard scores, or nor- 
malized scores. Performance tests and individual intelligence 
scales have norms expressed in МА? and IQ’s. Many group 
intelligence tests also have their raw scores put into MA and 
IQ terms. Such MA's and IQ’s are rarely comparable to the 
MA’s and IQ's of the Stanford-Binet. 


Educational achievement tests usually provide both age and 


SS " À 
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grade norms. From a table of norms, a teacher can tell whether 
her class is up to grade level, and she can tell how individual 
pupils in her class stand relative to each other on the sub-tests 
of the battery. Suppose that Carl W., age 11-2, and j just entering 
the sixth grade, earns a score of 18 on an arithmetical problems 
test of the Metropolitan Achievement Test. From the table of 
norms we find that Carl has a PR of 68 on the test. Further- 
more, we find that his age-equivalent is 12-4 and his grade equiv- 
alent is 6.9. Carl's score is typical of children about a year older 
than he, and his knowledge of arithmetic equals that of children 
who are completing the sixth grade. His PR, of course, reflects 
performance above the average. 

'The SRA (Science Research Associates) verbal and non- 
verbal tests are group tests of intelligence. Norms are given in 
PR's and IQ's. If a child achieves a score of 34 on the verbal 
section, for example, his PR from the table of norms is 40 and 
his IQ (really a standard score) is given as 96. The Stanford 
Achievement Test provides age and grade equivalents to obtained 
scores. Raw scores from nine sub-tests are converted into an 
equal-unit scale, in accordance with which a profile is drawn up 
(page 35). Suppose that Louise M., age 12 years and 6 months 
and in the last quarter of the seventh grade, earns a raw score of 
40 on the science test of the battery. From tables of norms we 
find that this score has a grade equivalent of 8.3 and an age 
equivalent of 13-4. Thus Louise's score in science places her 
above her age and grade levels. Her PR on the science test is 60. 

Many aptitude tests supply scaled score norms for various 
groups of workers differing in experience, training, and skill. 
Interest inventories are scored so as to reflect an applicant's in- 
terests in a large number of occupations. Thus if the vocational- 
interest blank is scored with the key for lawyer interests, we can 
tell whether the applicant has the interests of a lawyer and to 
what extent. Scores from personality Боа serve to iden- 
tify a subject as “dominant,” “introverted,” “neurotic” in 
relation to the norms given for these ШЕШ аны 
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Teachers use test norms for a number of purposes, which will 
be elaborated upon in later chapters. Among the more important 
objectives, we may list the following. 


1. To estimate group achievement. Performance of the class as 
a whole can be evaluated against national, state, or local 
norms (page 115). 

2. To evaluate individual achievement. A pupil's score on an 
educational achievement test is always considered in connec- 
tion with his native capacity or mental alertness, A slow or 
dull child may be working up to his limit, whereas a bright Д 
child may be performing below expectation. 

3. To evaluate family and cultural background. The achieve- 
ment of a class or of an individual will always depend on his 
socio-economic status, family background, and Opportunities. 

4. To evaluate the curriculum effects, A pupil’s achievement 
must be judged as good or poor in the light of the content, 
emphasis, and objectives of the school. 

5. To measure individual differences. There are always wide 
differences in academic achievement within a group or class. 
These differences are due in part to differences in native 
ability and in part to differences in environmental oppor- 
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‚ QUESTIONS FOR DISCUSSION 
1. A fifty-item mu 


А Itiple-choice test in science, 
pupils, showed score: 


administered to ninety 
5 ranging from 16 to 48. pj 


fty scores fell between 
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35 and 48. What would the distribution b: like? Would it be skewed? 
What measure of central tendency would be most suitable? 

2. In question 1, what would you conclude about the suitability of the 
test for the group? 

3. Explain the implications of each of the following correlation 
coefficients: 

(a) The correlation between height and score on an arithmetic test 

is .04. 

(b) Ratings of pupils for social adjustment and aggressiveness show a 

correlation of —.65. 

(c) The correlation between term grades and scores on a group 

intelligence test is .70. 

4. Rank the following 23 scores in order of size: 35, 40, 31, 29, 35, 23, 
32, 34, 28, 34, 15, 14, 34, 40, 22, 32, 30, 39, 50, 19, 40, 27, 37. Compare the 
“mid-score” with the mean. 

5. Karl's PR on a biology test is 48. What does this mean? 

6. Margaret has taken five tests. What would be the advantage of 
expressing her scores on these tests as PR's? 

7. Given the following: 


Parargaph Reading Atithmetic 
Mean 51.7 385 
с 9.2 6.5 


William achieves а score of 56 on the first test and 35 on the second. 
Convert these raw scores into z-scores. = 
8. How are age and grade norms obtained? Which is the more useful 


in determining placement? 
9. Two classes earn about the same mean on a test, but Class A’s SD is 
twice the size of Class B’s. What do you conclude from this fact? 


10. How would you validate a teacher-made test? 


СНАРТЕК 3 


INDIVIDUAL INTELLIGENCE SCALES 


This chapter will consider four individual intelligence scales or 
test batterics.* These are (1) the Stanford-Binet** (1937 or re- 
vised form) designed for children from age 2 through adolescence; 
(2) the Wechsler Bellevue Intelligence Scale, for use primarily 
with adults; (3) the Wechsler-Intelligence Scale for Children 
(WISC); and (4) the Arthur Performance Scale, useful from 
about age 4 to maturity. These four scales—one for adults and 
three for children—are representative of the best individual in- 
telligence scales now available. They are carefully constructed, 


` * A test battery is a group of carefully selected tests designed to operate as a 


°° The full name is Stanford Revision of the Binet-Simon Scale. 
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widely used, valid and dependable. Ordinarily, the individual 
intelligence test will not be administered by the classroom 
teacher. But the teacher must be familiar with the make-up of 
these scales and with their role in the school program if he is to 
make good use of the test findings. 

The individual intelligence examination should not be admin- 
istered by a novice. To give such a test—and more important to 
interpret it—requires special training in mental measurement 
and in clinical psychology, plus a sound knowledge of psycho- 
logical theory. In addition, at least six months should be spent in 
giving and scoring these tests under supervision, if one is to have 
a minimum of "clinical experience." Unfortunately, perhaps, 
directions and materials for giving the individual scales are 
readily available in the manuals, and the beginner is tempted to 
try his hand at administering the tests. Much undeserved criticism 
of the individual intelligence test—ánd of the MA and IQ—has 
arisen from the faulty administration and interpretation of these 
scales by the unskilled amateur. 


The Concept of General Intelligence 


Before examining the individual intelligence scales in detail, 
we must get a clearer notion of what the tests are attempting to 
measure. This means that we must formulate a definition of what 
is meant by "general intelligence." 

Definitions of general intelligence have run the gamut from 
such comprehensive biological descriptions as adjustment to the 
environment to the fairly narrow designation of aptitude for 
academic work. The French psychologist Alfred Binet defined 
intelligence as (1) the ability to take and maintain a definite 
direction—that is, to carry through a course of action once 
begun; (2) adaptability to new situations and new requirements; 
and (3) the power to evaluate and criticize one's own acts (not 
present in the feebleminded). Other psychologists agreeing in 


the main with Binet have stressed adjustment to life and capacity 
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to learn. In contrast with these broad formulations, Lewis M. 
Terman, author of the Stanford-Binet, has defined intelligence 
simply as the ability to carry on abstract thinking. 

Definitions of general intelligence must of necessity be broad 
when they stress biological adaptability to life. Such definitions 
are hardly incorrect, but neither are they useful. Indeed, any 
attempt to encompass such a comprehensive function as general 
adaptability is a well-nigh impossible task. On the other hand, a 
definition of intelligence simply as the ability to do school work 
is certainly too narrow; we should include proficiency in every- 
day activities in business: and the professions, where aptitude 
displayed in school finds ready application. 

In order to give greater precision to the concept of intelligence, 
the educational psychologist Edward L. Thorndike has suggested 
that we recognize at least three broad areas of intelligent be- 
havior. These "intelligences" he called abstract, mechanical, and 
social. Abstract intelligence he defined as the "ability to under- 
stand and manage ideas and symbols, such as words, numbers, 
chemical ór physical formulas, legal decisions, scientific prin- 
ciples and the like. . . .? In the case of students, this is very close 
to what is called scholastic aptitude. Mechanical intelligence in- 
cludes “the ability to learn, to understand and manage things and 
mechanisms, such as a knife, a gun, a mowing machine, an auto- 
mobile, a boat, a lathe. . . .” Social intelligence is “the ability to 
understand and manage men and women, boys and girls, to act 
wisely in human relations,” We should expect to find high ab- 
stract intelligence in scholars, scientists, executives in business 
and government; high mechanical intelligence in mechanics, 
builders, expert carpenters and plumbers; and high social in- 
telligence in politicians, salespeople, leaders in society. Presum- 
ably the successful civil engineer possesses high abstract as well 
as high mechanical intelligence; the successful criminal lawyer 
abstract as well as social intelligence; the machinery salesman 
mechanical and social intelligence. These “intelligences” are 
positively, but not always highly, correlated. Hence, a high level 
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of one “intelligence” may accompany a fairly low degree of 
another. A nuclear physicist (high in abstract intelligence) may 
be socially inept. And the man successful in business or politics 
may be mediocre in mechanical skills. Perhaps the able jack-of- 
all trades can be expected to rate well, but not necessarily very 
high, in all three areas. 

On examining the individual intelligence test, we find that it 
presents a variety of problems which demand the ability to utilize 
ideas and symbols, for example, words, numbers, diagrams, 
pictures, geometrical figures. When used with young children, 
general intelligence tests are primarily measures of mental alert- 
ness on the abstract level. For adults, these tests are measures of 
the aptitude for such occupational and other tasks as draw upon 
abilities operative in school work. In short, the individual intel- 
ligence test measures abstract or scholastic ability primarily and 
is rarely a gauge of mechanical aptitude or of social competence. 
The evidence for this view comes from an analysis of the tests 
themselves, as well as from many studies in which individual 
intelligence tests have been used. 


THE STANFORD-BINET INTELLIGENCE 
SCALE (1937 REVISION) 


Because of the time required to administer the Stanford-Binet 
(in most cases forty minutes to an hour) and the training de- 
manded of the examiner, this test is rarely given routinely in 
most schools. The classroom teacher must be generally familiar 
with the Stanford-Binet, however, in order to know what can 
be expected of it—that is, how it might add to her knowledge 
of a given pupil. The Stanford-Binet is а valuable supplement to 
a group intelligence test or to an educational achievement ex- 
amination when (1) a child has a severe reading disability or 
some physical handicap (for example, in sight, hearing, or mus- 
cular co-ordination); (2) when a pupil exhibits marked emotional 
stress or emotional disturbance; and (3) when other test results 
or school marks do not jibe with the teacher's estimate of the 
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pupil’s ability. For purposes of routine classification and place- 
ment, the group intelligence test is about as satisfactory as the 
Stanford-Binet, but the latter will provide a more accurate, de- 
tailed, and comprehensive appraisal of intellectual level and is 
more useful in diagnosis and prediction. 


Description. The 1937 edition of the Stanford-Binet represents 
a careful and thorough re-working of the earlier 1916 scale. The 
number of test items was increased from 90 to 129 and the scale 


TABLE 3-1 illustrative Tests from Stanford-Binet Scale 
Year IV 
1, Picture Vocabulary Child must recognize and name everyday 
objects seen in the pictures. 
2. Naming Objects Child is shown small toys representing com- 
from Memory mon objects. These he names, or they are 
named for him. Later he must recall from 
memory the name of each object. 
3. Picture Completion Child must finish the incompleted drawing 


of a man. 
4. Pictorial Identifi- Pictures of objects on cards to be identified. 
cation 7 
5. Discrimination of Recognition and identification of simple 
Forms geometrical forms. 
6. Comprehension Sensible answers to "why" questions. 
Alternate: Memory Repetition of short sentences read aloud to 
for Sentences the child. 
Year X 
1. Vocabulary The examinee must give definitions of 


eleven words in a standard vocabulary list. 


2. Picture Absurdities Must recognize what is “foolish” in a pre- 


Il sented picture. 
3. Reading and Reads a selection and reports from memory 
Report what is read. 


4. Finding Reasons 


Gives sensible reasons to explain cause-and- 
effect relations in familiar situations. 

Names as many words as he can in one 
minute: a measure of word fluency. 

The lists are read aloud at the rate of about 
one a second. 


5. Word Naming 


- 6. Repeating Six 
Digits 
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extended down to lower age levels and much strengthened at 
upper age levels. Two equivalent forms of the scale, called L and 
M, were constructed. Table 3-1 contains a selection of the items 
at different age levels. Note that at the lower age levels, such as 
IV, the test situations make use of objects and pictures and require 
that the child understand and carry out oral directions. At the 
upper age levels, X, XIV, and Average Adult, the test items are 
more abstract and bookish; the problems require verbal and 
numerical manipulation, reasoning, logical selection and choice ~ 


for Years IV, X, XIV and Average Adult 


Year XIV 

1. Vocabulary Larger vocabulary required than at year X. 

2. Induction Tests ability to grasp and apply a general 

rule. 

3. Picture Absurdities Must recognize what is "foolish" in a pic- 
Ш ture; more difficult than at Year Х. 

4. Ingenuity Tests ability to solve problems mentally, 

5. Orientation: Direc- Must be able to solve problems involving 
tion I space relations by following fairly com- 


plex directions. 
6. Abstract Words II Must define words like “loyalty” and 


"justice." 
Average Adult 
1. Vocabulary Larger vocabulary than at Year XIV. 
2. Codes Must learn two codes and write messages 
in them. 
3. Differences Tests ability to generalize; makes use of 
Between fairly difficult concepts. 
Abstract Words 4 
4. Arithmetical Requires solution of mental arithmetic 
i roblems. 
5. Sd е а of proverbs and fables. 
6. Ingenuity Solution of problems requiring “mental 


manipulation.” 


7. Memory for Sen- Tests ability to reproduce rather long and 
involved sentences heard once. 


tences E 4 
8. Reconciliation of Must tell how words denoting opposite 
Opposites states are alike. Tests ability to grasp ab- 


stract relations. ; 
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and good judgment. Memory for numbers and for sentences 
recurs throughout the scale. Questions dealing with specific facts 
learned in and out of school are excluded, but many common- 
knowledge questions are included on the reasonable assumption 
that what a person has learned in everyday living is a good index 
of what he can learn—and will learn—later on. Some Stanford- 
Binet test materials are shown in Figure 3-1 (facing page 54). 


Scope. The placement of test items at a given age level was 

made to depend on the responses of onc hundred children at each 

‚ age level below 6, two hundred children at ages 6 to 14 inclusive, 
and one hundred children at ages 15 through 18. In all, about 
3,000 children constituted the standardization group. 

Terman and his co-workers selected children whose parents 
constituted a good cross section of occupational levels in the 
United States for the year 1930. The Stanford-Binet, like the 
original Binet Scale, is an age-scale. (See page 32.) It begins at 
2 years and items are grouped at one-half year intervals (at 2, 
2%, 3, 34, 4, 4%) up to 5 years. Mental growth at the lower 
age levels is so rapid that the authors of the scale thought it wise 
to narrow the gaps between age levels over this range. From 5 
years to 14, test items are grouped by year intervals; and beyond 
14 there is an average adult level and three superior adult levels. 
"The Stanford-Binet is most useful over the age range from about 
6 to 14—that is, over the elementary grades. 


Scoring. The Stanford-Binet assigns a mental age (MA) to a 
child in accordance with his ability to progress up the age scale. 


As shown in the examples on page 51, two children may earn the 
same MA on the Stanford-Binet in different ways. * 


James, who is 9-3 or 111 months old, earns an MA of 8-10, or 
106 months, by scattering his answers up the scale from age 
VII to age XIII. Robert also carns an МА of 106 months, but 
does not scatter as much as James. MA is a measure of mental 
maturity or status. Children differ in the way in which they 
answer the test items, but by and large a child comes out with an 
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Test Record of James Brown, chronological age 9-3, 
or 111 months* 


Tests Passed Months Credit Total Credit 
Year Level and Failed Per Test Year Month 
vil all passed 7 

VIII four passed 2 months 8 
IX three passed 2 months 6 
X two passed 2 months 4 
XI one passed 2 months 2 
XII one passed 2 months 2 
XIII all failed 0 
7 22 


* The expression 9-3 means 9 ycars and 3 months. 


MA = 8-10 or 106 months 
James’ IQ = MA/CA x 100 = 106/111 x 100 = 95 


Test Record of Robert Green, chronological age 8-4, 
or 100 months 


Tests Passed Months Credit Total Credit 
Year Level and Failed Per Test Year Month 
VII all passed 7 

УШ five passed 2 months 10 
IX four passed 2 months 8 
X two passed 2 months 4 
XI all failed 0 
7 22 


MA = 8-10 or 106 months 
Robert’s IQ = 106/100 = 106 (decimal dropped) 


MA which indicates his ability to perform mental-manipulative 
tasks like those of the Scale. 

The intelligence quotient, or IQ, is found by dividing the 
child’s MA by his CA (chronological age) and is a measure of 
ames has an IQ “of 106/111, or 95, and 
Robert who is 11 months younger, has an IQ of 106/100, or 106. 
Both boys have the same ЕСТ maturity, but Robert is brighter 
than James because he has reached the maturity level of 8-10 
at an carlier age. The two measures, MA and JQ, are comple- 
mentary, each | providing distinctive information. A child of 8 


brightness or dullness. ] 
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and a man of 40 may each earn an MA of 8 years on the Stanford- 
Binet (be of the same mental status in terms of the tests). But 
the child has an IQ of 100 (8/8) and is normal, whereas the man 
is feebleminded, with an IQ of approximately 53. (Read from the 
tables in the Manual.) * 

"The IQ is a developmental ratio which inevitably loses its 
value as a child grows older and mental maturity is approached. 
There is little difference in mean performance on the Stanford- 
Binet at ages 15, 18, and 20, and a correction table is provided 
in the Manual which adjusts the CA divisors in order to make the 
older person’s IQ comparable to that of the child. There is no 
specific age at which intelligence can be said to “mature” or 
reach its peak, but 15 is taken somewhat arbitrarily to be the MA 
of the average adult on Stanford-Binet. For any person over 16, 
therefore, the corrected divisor is 15 years. The highest MA 
which can be earned on the Stanford-Binet by passing all of the 
tests in the Scale is 2234 years. This MA yields a maximum IQ 
for adults of 152—found by dividing 273 months by 180 months 
(that is, 15 years). 


STANFORD-BINET IN THE SCHOOLS 


1 The evaluation of pupils from their school grades or from sub- 
jective impressions of cleverness or brightness is often quite 
misleading. A teacher may describe a conscientious, amiable girl 
of ten who is one year overage for grade as “bright” when her 
IQ turns out to be relatively modest. Contrariwise, a rude, in- 
attentive youngster may be rated as "about average" or суеп 
below average when his IQ is in reality considerably above 
normal. Judgments of intelligence are always influenced by 
personality traits and social behaviors. It is not strange, there- 
fore, to find that two pupils must in general differ by as much 
as twenty points of IQ before a teacher is forced to lay other 


* See Terman, L. M., and Merrill, M. A. Measuring I i Y Y 
an, L. M., ALA. g Intelligence. New York: 
Houghton, Mifflin Co., 1937. Tables, рр. 415-450. 
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criteria aside and admit that the badly behaved youngster is 
brighter than the courteous, hardworking one. 

Teachers should know certain facts about the ІО, what it is 
and how best it can be used, in order to make maximum use of 
the information provided by the test. More specifically, the 
classroom teacher should know (1) the range of IQ's to be ex- 
pected in the school population, (2) the dependence to be placed 
on the IQ as a measure of intelligence, (3) to what extent the 
test has diagnostic value, and (4) the limitations of the IQ and 
the precautions to be observed in making interpretations based 
on it. These topics will be considered in the following sections. 


Range of IQ's in the School Population 


The frequency polygon in Figure 3-2 shows the distribution 
or spread of IQ’s for the nearly 3,000 children from 2 to 18 
years old who made up the standardization sample. The fre- 
quency polygon is close to the normal curve model (page 17). 
IQ's center at 100 and range about equally above and below this 


FIGURE 3-2 Distribution of 1Q’s on the Stanford-Binet Scale 
for Nearly 3,000 Children, 2-18 Years Old 


LQ. 


From Terman, Lewis M., and Merrill, Maud A., Measuring 
Intelligence. Reproduced by permission of Houghton Mifin 
Company. 
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value. The o of the IQ distribution is about sixteen points 
(exactly 16.4). This means that the middle 2/3 of school chil- 
dren will earn IQ's between 84 and 116. About 1/6 of the 
children will have IQ's above 116 and 1/6 will have IQ's below 
84. See*Figure 2-7. The percentage of school children who can 
be expected to occupy the different IQ levels may be summar- 
ized as follows: 


TABLE 3-2 


Numbers of Children in the School Population to Be 
Expected at Various IQ Levels 


Percent of Children 
ТА Level Description in Each Category 
130 and above Superior or gifted 3-5 
110 - 129 Above average to high 25 


90 — 109 Average or normal 45 — 50 
70 – 89 Low normal to dull : 20 25 
Below 70 Dull to fcebleminded 2-3 


The number of children found in any group (especially in the 
two extreme groups) will vary somewhat with the social and 
economic conditions of the community and with the standards 
set up for defining the different intelligence levels. 

The IQ is useful in setting educational expectations. Suppose 
that William Butler, a fifth-grade pupil in a large school system, 
has a chronological age of 10-2 and a Stanford-Binet IQ of 116. 
William reads at fifth-grade level, is somewhat above average in 
his other subjects, and is excellent in arithmetic. He is a quiet, 
well-behaved boy who seldom becomes angry or annoycd. 
William makes friends readily and is accepted as a member of 
his group. What are William’s educational expectations? 

Table 3-3 will be of help in answering this question. William 
falls in the upper 16 per cent of school children. He should have 
no trouble completing elementary and high school. If he is in- 
gustrious, emotionally stable, and, has intellectual interests 
William may be encouraged to go to college. It might be Wise 
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FIGURE 3-1 Test Materials Used in the Stanford-Binet Scale 
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FIGURE 3-4 Items from the Performance Part of the Wechsle 
Adult Intelligence Scale 
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to advise a college of not too high standards if William is lacking 
in self-confidence, and to enlist his parents’ enthusiastic support. 

Mary, age 12-0 with an IQ of 83, presents a very different 
picture from that of William. Mary is doing barely passing work 
in the fifth grade, though she is about two years overage for the 
grade, Since her MA is not more than 10 years,* she is perhaps 
doing all we can reasonably expect of her. It would be manifestly 


? Since MA/CA = IQ, Mary's MA is 12 X .83, or about 10 years. 
TABLE 3-3 


Educational Expectation in Relation to IQ Level 


IQ Level 
(Stanford-Binet) Educational Expectation 
120 + Can do acceptable work in a first-class college if properly 


motivated. 


115 - 119 Should do acceptable but not outstanding college work. 
Would probably do best in a small college where the 
work is individual and standards not too high. 


105 — 114 Should complete high school, and may do well in the less 
difficult college courses. "Will have trouble with science 


and mathematics. 


90 — 104 This group constitutes about 50% of the elementary 
school population. If not retarded by illness or other 
causes should complete the eighth grade on schedule. 
Some of these pupils will do fairly well ir high school. 


80 — 89 Usually one to two years over age for grade. Acceptable 
high school work very unlikely for IQ's below 90. A 
child of IQ 80 will compete the eighth grade—if at all— 
two-three years behind schedule. i 


75 - 79 These children may reach the fifth grade. Will rarely 
go beyond unless given much individual attention. 


Below 75 If one of these children reaches the fifth grade he will be 
14-15 years old. Unable to do fifth-grade work; but be- 
cause of chronological age is likely to be pushed ahead 
after repeating each grade two or three times. May be 
promoted because of age far beyond his mental capacity. 
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unfair to scold Mary and insist that she “try harder.” Mary’s 
educational expectation (see Table 3-3) is no higher than the 
eighth grade, if that. 


Stability of the IQ 


When a second form of the Stanford-Binet is administered to 
a child, this second IQ will often vary somewhat up or down 
from the first determination. Norman’s IQ, for example, may 
be 109 today, whereas it was 112 six months ago, and may very 
well be back to 106 six months hence. The stability of a test 
score when the test is repeated or another form given, is called 
the reliability of the test (page 28). Stanford-Binet is one of 
our most dependable mental examinations, with reliability co- 
efficients which are usually well over .90 (page 29). Despite 
this fact, fluctuations in individual IQ's can still be expected 
when the test is repeated or a second form administered. 

The reliability of a test is conveniently expressed by the stand- 
ard error (SE) of a test score (page 30). The SE gives the allow- 
able (onemight almost say the inevitable) changes to be expected 
when a second form of the test is given. The SE of the Stanford- 
Binet IQ is four to five points* for IQ’s between 90 and 110. 
The SE is slightly higher for high IQ's and somewhat lower for 
low IQ's. Expressed in terms of chances or probability of change, 
a SE of five points means that the odds are roughly 2:1 that an 
IQ of 102, for example, will not be higher than 107 (102 + 5) 
nor lower than 97 (102 — 5) on retest. The SE represents the 
amount of fluctuation to be expected in most cases. The change 
in a few individual cases may be somewhat greater than five 
points or somewhat less than five points. Fluctuations in IO from 
time to time arise from many causes: changes in the testing 
situation and changes in the child being tested. When a child’s 
mental or physical health or his home or school environment 
change radically between tests, fluctuations in IQ can be ex- 


* When SEie = 16 V 1-.90, we have 5 as the approximate value of the SE. 
This is a slight overestimation, as the reliability coefficient is usually above .90, 
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pected. Mental measurement is never as precise as physical meas- 
urement: a child is a much more variable “object” than, say, a 
piece of metal. Changes in IQ from one test to another rarely 
shift a child from one classification to another, however (see 
Table 3-3)—that is, from normal to superior or from dull to 
normal. The consensus, in fact, is that the IQ is extremely hard to 

_ change and that we can accept an IQ when expertly determined 
as a reliable appraisal of a child’s general mental level. 


The Stanford-Binet IQ in Diagnesis of Child Behavior 


Children who achieve the same mental age will differ in 1Q 
when their CA’s differ (page 51). Furthermore, even when the 
IQ is the same, two children may differ sharply in various 
aspects of mental development, as shown by the sorts of tests 

assed and failed and by the degree of scatter over the ‘scale. 
The Stanford-Binet is primarily a standard test-interview de- 
signed to furnish a cross-sectional view of a child’s intellectual 
capacities—that is, to give the level at which the child normally 
functions. At the same time, the school psychologist, in writing 
an account of a child’s performance on the test, will usually 
note irregularities in development and learning ability, and these 
observations provide the teacher with valuable clues to an under- 
standing ae the child. Visual handicaps, inco-ordinations, and 
other physical handicaps may be noted; so also may be noted 
deficiencies in arithmetic skills, іп word comprehension, in rea: 
soning, and in current information. The sub-tests of the Stanford- 
Binet call for fairly specific performances, and are not sufficiently 
numerous or comprehensive to permit the final judgment that 
“John is weak in number work, but excellent in rote memory,” 
or that “Sarah’s verbal facility far exceeds her manipulative 
skills.” But the pattern of a child’s responses and the relative 


strengths and weaknesses displayed on groups of items will pro- 


vide useful information. : | 
Parents are often puzzled when a child who is a discipline 


problem is, at the same time, described as above normal in in- 
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telligence. The reason, of course, is that Stanford-Binet is not a 
measure of social intelligence or of emotional stability but of 
general verbal or abstract level (page 46). At the same time, the 
observant psychologist will note and record characteristic emo- 
tional and. temperamental behavior displayed as the child takes 
the test. The rude or indifferent youngster, who doesn’t care and 
who doesn’t co-operate; the spoiled and petulant “brat,” who 
gives up and pouts at the first failure; the timid and insecure 
child, who inquires eagerly “Is that right"? after answering each 
item—all these children reveal their distinctive personality traits 
by the manner in which they tackle the test. Standards of be- 
havior in the home, ideals of conduct, values, and attitudes are 
often exhibited clearly, if indirectly, in the course of a mental 
examination. At Year VII, for example, is the question What's 
the thing to do if another boy (or girl, depending on the sex of 
the examinee) hits you without meaning to do it?” The child 
who is immature socially or reared in a rough-and-tumble com- 
munity will answer promptly “Hit him back.” The 7-year-olds 
who are better trained in acceptable social practices will qualify 
their replies, or suggest that forgiveness may be in order if the 
blow were truly accidental. 

The following case histories will illustrate how qualitative 
analysis of a child’s test performance can help the classroom 
teacher who refers him to the psychologists. 

Case I. SM, a boy; CA = 10-2, MA = 8-2, IQ = 80. 

This boy was referred by his teacher because of unsatisfactory 
work in the fourth grade. He is a good-looking, polite lad, 
normal 1n appearance and in social manner. Anyone unacquainted 
with his school work might judge SM. to be average in intelli- 
gence, or perhaps above average. On the Stanford-Binet, SM's 
vocabulary was childish with definitions in terms of use. He 
passed the vocabulary test only at Year VI. His answers to the 
picture and verbal absurdities were halting, poorly phrased, and 
uncomprehending: He һай inaccurate and meager responses to 
seeing relations" items—differences and similarities. He was 
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poor in number relations; his co-ordination and rote memory 
were fair. SM is a dull boy who may reach the eighth grade, but 
is not likely to go beyond. It was recommended that SM under- 
take vocational training. 


Case II. RW, a girl; СА = 11-2, МА = 11-3, IQ = 100. 

RW is a well-developed girl, apparently calm and self-pos- 
sessed. She was referred by her teacher because of poor work in 
the sixth grade; she is described as being inattentive and given 
to daydreaming. RW seemed indifferent to the test, but did not 
refuse to co-operate. She often asked ага question be repeated, 
and the examiner suspects slight deafness. She became more in- 
terested as the test proceeded, especially when she got the answers 
to several questions. Her vocabulary is at Year X, but her verbal 
ability is about normal, as shown by her ability to deal with 
pictures and verbal absurdities, name words, define abstract terms, 
and deal with similarities. Her attention was somewhat variable 
and she was easily distracted. She showed uncertainty in using 
number relations, as, for example, in making change. RW is 
normal in intelligence and should be able to do satisfactory work 
in the sixth grade. It is suspected that her daydreaming is, in 
part, a consequence of puberty. It was recommended that the 
classroom teacher check on RW’s friends, outside activities and 
home conditions. 


Case III. HP, a boy; CA = 6-5, MA = 9-6, IQ = 148. 

The second grade teacher is not sure what to do with HP; he 
seems to know everything she is teaching. HP's father is a 
rominent surgeon. This boy entered school at 6-1 and was put 
in the second grade. He is well mannered, normal in play and in 
social activitics, and gets along well with his classmates. HP 
whizzed through the tests for Years VI, УП, and УШ. His. 
vocabulary is at Year X. He defined an orange as "a citrus fruit, 
round and yellow, comes from Florida." His co-ordination is not 
up.to his verbal level, but his memory and perception of differ- 
ences and likenesses are excellent. HP is a very bright youngster. 


+ 
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He should be ready for high school by age 12 or earlier. He 
should now be in the fourth grade, if he is ready for it socially. 
If promotion is not feasible, a program of outside reading and 
some special attention is suggested. 


Precautions to Be Teken in Interpreting the IQ 


Some of tHe factors which may influence a child’s IQ have 
been touched upon in preceding paragraphs. To what extent the 
IQ is an index of “innate ability" will depend upon the co-opera- 
tion and motivation of the examinee, and upon how expertly the 
test has been administered. Several conditions which may affect 
the reliability of an IQ are the following: 


Physical causes: Sensory defects, deafness, poor eyesight. Malnutrition 
and illness are also important. 

Examiner: The personal equation of the examiner may be crucial. Mental 
test examiners who are poorly trained, have harsh and unpleasant 
voices, peculiarities of manner or dress, or who are supercilious or 


arrogant in their relations with the child get poor co-operation and 
uncertain test results. 


Testing conditions: Test results are likely to be unreliable when the 
examination room is bare, too cold or too warm, or overdecorated. 


Coaching on the tests must always be watched for, since the tests have 
been widely distributed. 


Environmental surroundings: The degree of stimulation received in the 
home, the school and the community will markedly affect the test 
performance. Children from homes broken by divorce or by drunken- 
ness will often show IQ increases of as much as twenty points after 
several months of kind treatment. On the other hand, children from 
good homes who have been transferred to a deprived and restrictive 
environment (as, for cxample, in war) may show sharp drops in IQ. 


Because of the many factors which may affect its determina- 
tion, a Stanford-Binet IQ should not immediately be denounced 
as worthless should there be a considerable shift in a second 
rating. Instead, a drastic change in IQ should be taken as a 
challenge, and the causes ferreted out if possible. The neglected 
dull normal child when taken into a good home will often show 
an increase in measured IQ, as will a child adopted into a good 
family. By the same token, a normal child will do poor school 
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work if he is insecure and unhappy. It seems very unlikely that 
even sharp changes in IQ reflect a real alteration in a child’s 
aptitude. At least all of the environmental factors should be con- 
sidered before this conclusion is reached. 


Constancy of the IG over the Age Range 


Suppose that Bob White, who is 7 years old, has an MA of 8 
years on Stanford-Binet and an IQ of 8/7, or 114. When Bob 
is 14 years old, his MA must be 16 years if his IQ is to remain 
constant at 114 (16/14 = 114). The IQ is a measure of bright- 
ness or dullness relative to a child’s age group. Hence, should an 
IQ fluctuate widely—as, for example, from 114 to 85 or to 140— 
the ratio MA/CA becomes valueless. We have said earlier (page 
56) that when the IQ of a child has been determined by an 
expert, it is a highly dependable index. But whether. the IQ 
remains constant over the years from 6 to 14 (over the elemen- 
tary school, for example) will depend, for one thing, on the way 
in which the test has been constructed. This question is appro- 
priate, therefore: “Is the Stanford-Binet so constructed as to 
make a constant IQ probable or even possible?” 

There are three conditions which an intelligence test must 
meet if the IQ, defined as the ratio, MA/CA,* is to remain 
constant over the age-scale. These are: 

1. Increased spread of MA's (larger SD's) as we go up the 
age-scale. ' 

2. Homogeneity of mental function over the age range cov- 
ered by the scale. Homogencity means that the test measures the 
same "intelligence" for example, from age 2 to age 18. 

3. Zero correlation between chronological age and IQ. 

These conditions are met to a high—though not perfect— 
. degree by the Stanford-Binet. They are not met, even approxi- 
mately, by most group intelligence tests (page 97). Let us 
examine each condition further. 

1. The SD of the Stanford-Binet MA distributions increases 


* The IQ may also be defined as a standard score (p. 39). The conditions for 
1Q constancy, given above, apply only to age-scales. 
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fairly regularly with chronological age. At Year VII, for ex- 
ample, the SD of the mental age distribution is 1 year; at Year X 
the SD is 1.6 years; at Year XIII it is 2.3 years; and at Year XVI 
it is 2.6 years. This means that if Bob White has a CA of 7 years 
and is one SD above the mean for his age (that is, at 8 years), 
his IQ will be 8/7, or 114. If Bob maintains his rate of mental 
growth, at age 10 his MA will be 1 SD above the mean, or at 
11.6 years (10 + 1.6). ВоЬ IQ is now 11.6/10 or 116. At year 
13, should Bob stay 1 SD above the mean, his MA will be 15.3 
(13 + 2.3) and his IQ 117. And at age 16, should Bob maintain 
his rate of growth, his MA should be 18.6 (16 + 2.6) and his IQ 
18.6/16 or 116. Figure 3-3 shows that when a child maintains 
an accelerated rate of growth, his IQ (like Bob’s) will remain 
approximately constant—that is, within 2 to 3 points. 


FIGURE 3-3 = Age-Progress Curves for the Stanford-Binet Scale 


[Note that the spread of MA’s becomes greater with increasing 
chronological age.] 
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Figure 3-3 shows that when a child is below the mean for his 
age, his IQ will again remain approximately constant should he 
maintain his slower rate of growth. If a child has an MA of 6and a 
CA of 7—is 1 SD below the mean for his age—his IQ will be 
6/7, or about 86. Should this child maintain his slower rate of 
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growth, at Year XIV his MA will be 11.8 (14-2.2) and his IQ 
will be 11.8/14, or 84. It is the increasing spread of MA’s with 
increasing CA which keeps the ratio MA/CA constant within 
2 to 3 points, always provided the child maintains a constant rate 
of growth. (See Figure 3-3.) 

2. Statistical analysis has shown that the correlation between 
successive MA levels is very high, and that the Stanford-Binet 
is measuring essentially the same “intelligence” as we go up the 


-age-scale. 


3. When a child reaches the upper “teens, mental growth 
changes as shown by the Stanford-Binet fail to keep pace with 
chronological age. When this happens, the curves in Figure 3-3 
lose altitude and bend over to become parallel with the baseline. 
Failure of the MA to increase with CA leads inevitably to a fall- 
ing IO among older children and, if uncorrected, to a negative 
correlation between CA and IQ. (Negative correlation follows 
because the CA continues to increase, whereas the IO no longer 
does—see page 52.) To overcome this fault in the age-scale, 
the authors of Stanford-Binet provide a steadily decreasing CA 
divisor from age 13 and above. This procedure bolsters up the 
IQ by lessening the denominator (CA) and thus balancing the 
decreasing numerator (MA). This means that a child's IQ does 
not bave to decrease as the child grows older—and that there is 
no systematic correlation (positive or negative) between CA 
and IQ. 


THE WECHSLER-BELLEVUE 
INTELLIGENCE SCALE* 


Description. The Stanford-Binet is sometimes used to measure 
the intelligence of adolescents and young adults, but it is not well 


2 r Adult Intelligence Scale (WAIS), published in 1955, repre- 
Де е ыар of the Wechsler-Bellevue Intelligence Scale 
(W-BIS). WAIS makes use of the same principles of construction, scoring and 
ТО derivation found in the older scale, and the two are essentially the same test. 
W-BIS is described here rather than WAIS because it is better known and is 


still widely used. 
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suited to these groups, since the items of the test were selected to 
appeal primarily to children. A better examination for measuring 
adult intelligence is the Wechsler-Bellevue Intelligence Scale, an 
individual intelligence test designed especially for adults. The 
Wechsler-Bellevue is, on the whole, a well-made examination. 
The group used in standardizing the test battery—that is, the 
group upon whose answers the scoring and norms depends— 
consisted of about 1700 persons chosen from a larger group of 
3500. The sample was chosen to represent the occupational dis- 
tribution of the adult population at the time of the 1930 census. 
The sample is adequate in size, but the fact that it was drawn 
mostly from New York City and New York State renders ques- 
tionable its claim to represent the country as a whole. 

The Wechsler-Bellevue Scale consists of two parts, a Verbal 
Scale and a Performance Scale. Language is required in the first 
scale, but the tests in the second part demand no language in the 
actual solution of the problems. Directions, however, are given 
orally. What is called the Full Scale is a combination of the 


Verbal and Performance sections. The Verbal Scale is made up 
of five tests, as follows: 


VERBAL SCALE 


1. General Information: Twenty-five questions covering a wide range of 


common information and dealing with facts which all normal adults 
have presumably had a chance to learn. Questions are graded in 
difficulty from easy to hard. 

2. General Comprehension: Ten questions and two alternates, in each 
of which the examince is asked to tell what should be done in certain 
situations, or why certain practices should be followed. The questions 


are planned to measure practical judgment, common sense, and 
understanding, 


3. Arithmetic Reasoning: Ten mental ari 
is presented orally and must be solv: 
pencil (“in the head”). 

4. Digits Forward and Backward: Memory span for digits presented one 
at a time and ranging in number from 3 to 9. In the second part of 
the test, examinee must giye the list of numbers in reverse order. 


5. Similarities: Twelve word-pairs, each pair alike in some way. The 
examihce must say in what way the two words are alike. 


thmetic problems. Each problem 
ed without the use of paper or 
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. Vocabulary (Alternate): A list of forty-two words graded in diffi- 
culty to be defined orally. 


There are five tests in the Performance Scale, as follows: 


PERFORMANCE SCALE 

. Picture Completion: Fifteen cards, each containing a picture from 

whieh some part is missing: The examinee must give the missing part. 

2. Picture Arrangement: Six sets of pictures, each set containing from 
three to six separate pictures. The examinee is to arrange the pictures 
in any given set so that they tell a story. 

3. Object Assembly: Three form-boards—the Manikin, the Profile, and 
the Hand. The parts of each form-board must be put together, much 
as in a jigsaw puzzle, to form a complete object. 

4. Block Design: Sixteen small cubes (blocks) colored red, white, and 
red-and-white on the sides. The blocks are to be arranged to match 
seven designs presented on test cards. The designs require from four 
to sixteen cubes. 2 

5. Digit Symbol: A well-known association test. Nine numbers are 
matched with nine symbols in accordance with a key. 


Samples of the items from the performance part of the 
Wechsler Adult Intelligence Scale are shown in Figure 3-4 
(facing page 54). These tests are “performance” in the sense 
that the examinee in solving the problem must make use of 
diagrams, pictures, form boards, and cubes. But “ideas”—that 
is, symbols—are certainly not excluded. Wechsler's performance 
tests, therefore, are measuring abstract rather than motor or 
mechanical intelligence. 

Scope. The Wechsler-Bellevue Scale provides scores in the 
form of “IQ's.” Norms run as low as 10 years, but the scale's 
principal application is over the age range from about 20 to 60 
years. Beyond 60 years, Wechsler-Bellevue IQ's are not always 
dependable, owing in part to the small samples at advanced age 
levels. But these IQ's may be taken as useful estimates of general 
intelligence. Age-level scores on the Full Scale (Verbal 4- Per- 
formance) show a gradual decline after 20, the drop in score 
from age 20 to age 60 being about 20 per cent. 


Scoring. Following the directions given in the scoring guide 
(Manual), the examiner first adds up the items done correctly 
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(speed is sometimes a factor) for each of the ten sub-tests. Scores 
on each sub-test are then converted into standard scores (page 
38), in which the mean for the 20-34 age group is set at 10 
and the SD at 3. Conversion of the separate sub-test scores 
into a common standard score scale allows the examiner to com- 
bine the tests into a single index and thus to compare the ups and 
downs in performance from sub-test to sub-test. 

The Wechsler-Bellevue does not provide mental ages, since 
the concept of mental age, though useful with children, has little 
meaning when applied to normal adults. The Scale does provide 
for an IQ (called a “deviation IQ"), which is essentially a stand- 
ard score. There are three IQ's, one from the Verbal, one from 
the Performance, and one from the Full (combined) Scale. In 
each case these deviation IQ's are found in the following way: 
Scores on the sub-tests (10 for the Full Scale) are added and the 
total is converted again into a standard score, this time with a 
Mean — 100 and a SD — 15 (page 39). At each age level (for 
example, at 30, 40, 50) the mean score got from the sub-tests is 
set at IQ 100. A score which is one SD above the mean at алу 
age level then becomes an IQ of 115. Putting the IQ for each age 
level at 100 adjusts for the steady fall in total test score with age. 
Standard score IQ's or “deviation IQ's" below 100 denote the 
same degree of retardation with reference to one's age group. 
For example, we read from the Manual that a man aged 35 who 
achieves a score of 75 on the 10 tests of the Full Scale has an IQ 
of 92—is slightly below the mean of his age group. The same 
Score of 75 bécomes an IQ of 96 at age 45 and an IQ of 100 at 
age 60. This means that a total score of 75 on the 10 sub-tests is 
"normal" (or “at age") for age 60 and hence receives an IQ of 
100. But the score of 75 is below the mean for the younger 
groups. Again, the examinee who earns a score of 90 (Full Scale) 
has an IQ of 109 if he is 57, an IQ of 102 if he is 37, and an IQ 
of 94 if he is 22 years old. 

To summarize, Wechsler-Bellevue Scale IQ's are converted, 
or standard, scores in which the mean is always 100 for each age 
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group and the SD is 15. Wechsler-Bellevue IQ's have the same 
meaning from one age to another in the sense that an IQ of 105 
or of 86 implies the same relative superiority or inferiority to the 
examinee's age group. The Wechsler-Bellevue IQ is a standard 
score, whereas the Stanford-Binet IQ is a ratio, MA/CA. The 
two indices are highly correlated, but are not equivalent. To 
avoid confusion, it helps to write “W echsler-Bellevue IQ" when- 
ever the deviation IQ is meant. Both the Wechsler-Bellevue IQ 
and the Stanford-Binet IQ are measures of abstract intelligence 


(page 46). 


THE WECHSLER-BELLEVUE SCALE 
IN THE SCHOOLS 


'The Wechsler-Bellevue Scale has been widely used in the 
individual study of adolescents and older students for whorn the 
content of the Stanford-Binet is inappropriate. The test is most 
valuable, therefore, to teachers in the upper grades and in high 
schools and technical schools. 


Range and Stability of Wechsler-Bellevue IQ's 


The range of IQ's in the general school population is about the 
same for the Wechsler-Bellevue Scale as for the Stanford-Binet. 
Table 3-3 will serve, therefore, as a guide in the interpretation of 
a test score. Table 3-3 may be taken also as providing a statement 
of the educational expectations of older students when we know 
the Wechsler-Bellevue IQ. The reliability of the Wechsler- 
Bellevue Scale, as given by its standard error, is approximately 
five points. Hence, the IQ from this scale has about the same 
stability as the Stanford-Binet IQ. 


The Wechsler-Bellevue Scale in Diagnosis 


The Full Scale—like the Stanford-Binet—yields a measure of 
a student’s general mental level and is often used to provide this 
information. The Wechsler-Bellevue, however, has also been 
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widely employed in mental hospitals and clinics for the diagnosis 
of abnormal behavior. The Scale has been useful in the study of 
variations of performance in schizophrenia and other mental dis- 
eases, in senile deterioration, and in assessing the effects of brain 
damage and the results of brain surgery. The fact that there are 
eleven separate tests (six Verbal and five Performance) in the 
Full Scale has led clinical psychologists to attempt to discover the 
relative efficiency of various mental functions from irregularities 
in test performance. 
The diagnosis of differential abilities (strengths and weak- 
nesses) from the sub-tests of the Wechsler-Bellevue must always 
„ be taken as tentative, though an examination of the different sub- 
tests may provide valuable clues. The various tests of the Scale 
are too short and too.complex (in that they test overlapping 
abilities) to allow a sweeping judgment to the effect that “Bill 
has poor planning capacity and poor judgment” or that “Mary 
has а good memory arid adequate concentration.” Observations 
of this sort are valuable only if made cautiously and taken in con- 
junction with other evidence. The Full Scale is a good index of 
present mental efficiency, and the difference between the Verbal 
and Performance IQ’s may be significant of the academic vs. the 
non-academic "mind" (page 75). But judgments drawn from 
specific sub-tests with respect to strengths and weaknesses in 
memory, learning, perception, planning capacity, concentration, 


emotional blocks, and the like must be taken as suggestive rather 
than conclusive. 


WECHSLER INTELLIGENCE SCALE 
FOR CHILDREN 
Description. The WISC, as it is called, is a downward revision 
of the older Wechsler-Bellevue to render the test more suitable 
for young children. There are ten sub-tests and two alternates 
(twelve in all) in the WISC. The sub-tests have the same form 


and cover the same: content as the Wechsler-Bellevue, except 
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that easier items have been added. Tests are grouped into five 
Verbal and five Performance as follows: 


Verbal Scale Performance Scale 
General Information Picture Completion 
General Comprehension Picture Arrangement 
Arithmetic Block Design 
Similarities : Object Assembly 
Vocabulary (Digit Span) Coding (or Mazes) 


The Wechsler Intelligence Scale for Children differs in several 
respects from the Wechsler-Bellevue. In the Verbal Scale, 
Digit-Span proved to be less satisfactory than the other tests and 
hence became an alternate, Vocabulary being substituted. In 
the Performance Scale, coding is a somewhat easier version of the 
Digit-Symbol test. Mazes are sometimes given instead of coding, 
but the second test is usually preferred, since it takes less time 
to administer. The maze test is the only test not found in the 
Wechsler-Bellevue. 


Scope. The WISC is a better made test than the Wechsler- 
Bellevue. To provide norms, one hundred boys and one hundred 
girls were tested at each age level from 5 to 15. Children in the 
standardization sample were drawn from eleven states and from 
three institutions for the feebleminded. The sample was carefully 


. checked to give a cross section of geographic areas, urban-rural 


groups, and occupational levels of parents. | 


Scoring. As was true of the Wechsler-Bellevue, all sub-tests 
were first converted into standard scores in a distribution with 
M. — 10 and SD — 3. Tables are provided for reading scale 
score equivalents to raw scores for each 4-month period from 5 
to 15 years. These equally weighted sub-test scores are added 
and then again converted into “deviation 10°,” with Mean = 
100 and SD = 15 (page 39). Verbal, Performance, and Full 
Scale IQ's may be read from appropriate tables in the Manual. 
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Approximately 50 рег cent of school children can be expected 
to earn WISC IQ’s between 90 and 110. 


Differences Between the WISC and the Stanford-Binet. The 
WISC differs from the Stanford-Binet in several important re-, 
spects. First, all items of a given sort in the WISC are organized 
into sub-tests instead of different kinds of items being placed at 
successive age levels. WISC is a point scale rather than an age 
scale. Second, the WISC IQ is a deviation IQ—a standard score 
in a distribution with Mean = 100 and SD = 15—whereas the 
Stanford-Binet IQ is a developmental ratio or MA/CA. The two 
IQ's are closely related (the correlations between the two sorts of 
scores run from .80 upward), but they are not identical (page 
66). The SD of the Stanford-Binet distribution of IQ's is 16, as 
against the WISC SD of 15; and some of the difference between 
the two IQ's is due to the greater spread of the Stanford-Binet 
IQ's. Furthermore, the two mental examinations differ in length, 
variety and difficulty of items. Finally, the WISC provides for 
three IQ's—a Verbal, a Performance and a Full Scale. There is 


only one IQ from the Stanford-Binet, based upon all of the tests 
in the scale. 


THE WISC IN THE SCHOOLS 


Both the WISC and the Stanford-Binet are widely used with 
school children, and in most cases there is little to choose between 
the two examinations. Many psychologists regard the Stanford- 
Binet as more satisfactory for use with very young children, 
since the WISC is not always easy to administer when the child 
is under seven ycars old. WISC takes less time to give and to 
score than does Stanford-Binet, and some examiners prefer it 
over the age range of the elementary school. The WISC Full 
Scale IQ has a higher correlation with Stanford-Binet IQ than 
does either the Verbal or the Performance Scale IQ. 

Bright children tend to score higher on the Stanford-Binet than 
on the WISC, whereas dull pupils score higher on the WISC. 


i 
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The separate IQ’s of the WISC (Verbal and Performance) are 
often valuable in bringing out differences in verbal and manipu- 
lative skills. Sometimes a child (usually a boy) will do better on 
the performance tests of the WISC than on the verbal indicating, 
perhaps, greater aptitude for vocational than for academic sub- 
jects. A bookish youngster who reads a great deal may do much 
better on the verbal tests. The performance IQ is usually higher 
than the verbal in severely disturbed adolescents, and this differ- 
ence often appears also in younger dull students. From the 
manner in which the child handles the verbal tests, the expert 
examiner will often note evidences of insecurity as revealed by 
incoherence, verbosity, poor attention, and defeatism. Poor 
performance on the manipulative tests often reveals inept plan- 
ning and defective co-ordination, whereas good performance 
shows concentration and adequate sensory-motor organization. 


Range and Stability of the WISC IQ's 
The range of WISC Full Scale IQ's to be expecteu in the 
general school population, and the meaning of these "scores" are 
shown in Table 3-4. 
TABLE 3-4 
Intelligence Classification for WISC IQ's 


Percent 
IQ Ranges Classification in Each Group 
130 - very superior 2 
120 - 129 superior 7 
110 - 119 bright normal 16 
90 — 109 average 50 


80 – 89 dull normal 16 
70- 79 borderline 
69 below mental defective 


It will be seen that these classifications correspond closely to those 
for Stanford-Binet IQ's. 
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Reliability coefficients for the WISC are generally above .90. j 
They are higher for the Full Scale than for either the Verbal о? 
Performance Scales. The standard error of a WISC IQ is 4-5 
points. 


MA's from the WISC 


The WISC does not ordinarily make use of mental age, but 
when mental ages are required for clinical or for legal reasons. 
they can readily be determined. The Manual (Appendix E) pro- 
vides a table of "test-age equivalents to WISC raw scores." By 
reference to the table, we find the chronological age of a child 
for whom a given raw score is typical (or average), and this is 
the MA corresponding to the score. For example, a score of 12 
on the Comprehension test is achieved on the average by children 
who are 10-6 years old. Hence, a score of 12 in, Comprehension 
has an equivalent MA of 10-6. The ean of the sub-test MA’s is 
computed (Mean Test-Age Method) or the median of the MA’s 
(Median Test-Age Method). Either of these determinations gives 
the final over-all MA. A closely equivalent method for determin- 
ing MA’s from the WISC is by use of the formula MA = IQ X 
CA. A child who achieves an IQ of 110 and who is 8-2 years old 
has a MA of 110 X 8-2, or approximately 9-0 years. 


PERFORMANCE TESTS 
Development of Performance Tests 


Performance tests designed to measure general mental ability 
have been often used in the schools (1) as substitutes for the 
more verbal tests, and (2) as supplements to the Stanford-Binet 
and other linguistic scales. Performance and non-language tests 
must of necessity be employed with pre-school children and with 
the very dull. Such tests are useful additions to the Stanford- 
Binet or WISC in the mental examination of children with speech 
and language defects or children with visual and auditory impair- 
ment. Batteries of performance tests have long been used in 
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psychological clinics and in institutions for the feebleminded. 
The classroom teacher should know about performance tests, 
though he will encounter them much less often than the WISC 
or the Stanford-Binet. 

The Pintner-Paterson Scale of Performance Tests (1917) was 
the first organized battery of manipulative and non-language 
tests, Widely used for many years, this scale has now been re- 
placed to a considerable degree by other batteries, based upon 
it. The Pintner-Paterson Scale consists of fifteen separate tests. 
'The ten tests most often used (in what is called the Shorter 
Scale) include four form boards, three picture completion tests 
(of the jigsaw puzzle type), two object assembly tests, and 
one block-counting test. 

Later performance scales are the Cornell-Coxe Performance 
Ability Scale (1934) and the Arthur Point Scale of Performance 
Tests (1930, revised 1947). These test batteries draw heavily on 
the Pintner-Paterson, but include, too, important additions and 
revisions. In addition to these test batteries, there are a number 
of other performance tests, of which a graduated series of mazes, 
the Porteus Mazes, is the best known. Widely used types of per- 
formance tests are the object assembly (page 65), various form 
boards, block counting, and block design. Two of these, block 
design and object assembly, are found in the Wechsler-Bellevue 
Scale. 

Norms are generally available for the individual performance 
tests, So that one may use one or more test$ without having to 
administer the whole scale. 


The Arthur Scale 

The Arthur Point Scale has been widely used over the age 
range of thc elementary school. It is made up of performance 
tests taken from various sources; it was first published in 1930 
and revised later in 1947. The later edition is a considerable im- 
provement over the original insofar as standardization is con- 
cerned, and the Scale is a good example of a performance battery 
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designed for children. Figure 3-5 (facing page 55) shows the five 
tests in the Arthur Point. 


There are five tests in the Arthur Point Scale, as follows: 


Knox Cube. The four cubes (see Figure 3-5) are tapped in a 
certain order by the examiner, for example, cube 1, cube 4, 
cube 3, cube 3. The child is told to imitate the tapping order. 
Tapping sequences become longer and more complex, until 
they can no longer be done by the child. 

Seguin Form Board. Ten common geometric forms (Figure 3-5) 
are to be fitted into the right apertures in.the board. 

Porteus Mazes. The child is told to trace the shortest path from 
the entrance to the exit in a maze, not lifting the pencil from 
the paper. If he makes an error by crossing a line or entering 
à wrong pathway, he is stopped and given a second trial. Mazes 
increase in difficulty from 3-year level to adult. 

Healy Picture Completion II. As shown in Figure 3-5, the test 
shows successive scenes in a boy’s life during a typical school 
day. Small pieces or blocks have been cut out of the scene. The 


child must select the appropriate pieces from the box and fit 
them-in place. 


Arthur Stencil Design Test. The child must reproduce designs 
of increasing complexity. Standard designs to be copied are 
presented on cards. Each design can be reproduced by fitting 
together stencils in different colors on a solid white card. 
Several stencils are needed for the more detailed designs. 


Scope. The Art 


from about 4 years to maturity, 
children. The Scale is employed 
mentary to, or as a substitute for 


hur Performance Scale covers an age range 
but is used chiefly with younger 
mainly as a clinical test supple- 
, the Stanford-Binet. 


Scoring. Scores on the sub-tests 
are first converted into point sco 


а score of 31 points becomes an MA of 
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10-0, MA is divided by CA to give a “performance IQ.” These 
IQ’s are not equivalent to the IQ's from verbal intelligence 
scales and are not to be so regarded. Arthur Scale IQ’s should 
always be described as “Arthur Scale IQ’s.” 


Performance Tests in the Schools 


Correlations between scores on the Arthur Scale and the 
Stanford-Binet are fairly high (.50 or more). The two tests are, 
however, not measuring exactly the same functions, and hence 
the Arthur IQ is often used as a "performance supplement” to 
the Stanford-Binet IQ. Arthur IQ's are higher than Stanford- 
Binet IQ's when the latter are low, that is, below 90; and this 
discrepancy is especially striking when children are very dall. 
There is evidence that low performance test scores may be in- 
dicative of behavior problems and of emotional instability. This 
result probably grows out of the disturbed child’s poor attention 
span, poor perception of relations and ineptitude in manual аспу- 
ities. Emotional involvement may take expression in bizarre and 
unusual responses. 

For the classroom teacher the main value of a performance 
test lies, perhaps, in the fact that such tests (1) may reflect poor 
language development or lack of language training, and are (2) 
often indicative of cultural and educational handicaps. As 
pointed out on page 68, a comparison with verbal tests often 
reveals, for instance, children whose manual and manipulative 
skills (“concrete intelligence”) run ahead of their verbal facility 
("abstract intelligence"). Performance tests serve, too, to identify 
the shy and inarticulate child who is brighter than the verbal 
tests show. Performance tests are not especially useful with 
normal school children over 12 years of age and they rarely 
differentiate significantly among older bright children. 


Case Histories. The following brief case histories will illustrate 
how performance tests, when used together with verbal tests, 
may provide a better understanding of a pupil's capabilities. 
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Recommendations in most cases must be tentative and subject to 
possible revision in the light of further information. 


Case I. Donald B.: age, 10-2; Stanford-Binet IQ, 92; Arthur 
Scale IQ, 106. is 

Donald is doing poor work in the fifth grade. His father is a 
barber, his mother a clerk (part-time) in a store; neithér parent 
went beyond the seventh grade. There are three other children 
in the family, all younger than Donald. There are few books in 
the home, but the family owns a TV set and a new automobile. 
Donald reads the sports page and the comics in the daily news- 
paper, but little else. He talks in brief sentences and is generally 
unresponsive in school. He is a well-grown boy for his age, a 
good athlete, and is well accepted by his classmates. He has 
never been a behavior problem. 

, Recommendation: Donald’s performance IQ is fourteen points 
higher than his Stanford-Binet IQ. In view of his relatively 
meager abstract intelligence, this boy is probably doing as well 
as we can expect. He may get to high school, but will almost 
certainly not complete more than one year. Vocational training 
seems to be indicated. He will continue to have trouble with 
verbal subjects, but may be very successful at a skilled trade. 

` Case II. Joan M.: age, 8-3; Stanford-Binet IQ, 126; Arthur Scale 
IQ, 109. 
Р Joan is doin 
1s social rather 
a widow, is a 


g excellent work in the fourth grade. Her problem 


a successful dress desi 
is alone much of the time. She rea 
friends and is often left о 
to eae and is shy and withdrawn. 
€commendation: Joan’s low erformance I , со ji 
her high Stanford-Binet IQ, RE a lack cus i 
Concrete" activities, such as running, playing out-of-doors 
skipping rope, dancing, and the like. This lack of o í 


cing ortuni 
develop manual skills is often found in children an AED 


gner. Joan is an only child and 


ds a great deal but has few close 


ut of class activities. She has a tendency 


than scholastic. Her father is dead, and her mother, : 


UM 
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city. Joan’s mother may be encouraged to meet with other class- 
room mothers, arrange parties, and invite Joan’s classmates to 
her home. The aid of the physical education teacher in getting 
Joan into games should also be sought. The classroom teacher 
can often see to it, by suggestion and indirection, that Joan is 
included in class parties and out-of-class activities. 


Case III. Bob W.: age, 9-2; Stanford-Binet IQ, 104; Arthur Scale 
IQ, 115. 

Bob is doing satisfactory work in the fourth grade. He is shy 
and timid, with a slight tendency to stammer, especially when 
questioned. He is.one of four children, the other three being girls 
all older than he. Bob's father is a successful lawyer, and his 
mother is a college graduate and a prominent club woman. The 
parents have decided that Bob, as the only boy, is to be a pro- 
fessional man, preferably a physician (his grandfather was a well- 
known surgeon). They are dissatisfied with Bob's marks, and 
are sure he is intelligent and that the teacher is to blame. 

Recommendation: Bob is clearly a normal boy. He is not 
bright, though he probably is brighter than his Stanford-Binet 
IQ indicates. The parents must somehow be reconciled to the 
fact that (1) Bob is not of professional caliber, and (2) a lower 
vocational goal (one within Bob's intellectual grasp): will make 
for a happier boy and probably a much happier life. They must 
be urged not to scold the boy and thus make him feel more 
inferior than he already does. This is a difficult problem, because 
it is the parents—not the child—who have to be "sold" on a 
different program from the one they have planned. 


SUGGESTIONS FOR FURTHER READING 


` General: 
Anastasi, A. Psy 4 
Cronbach, L. J. Essentia 
2 F. S. Theory and Practice of Psychological Testing (Rev. 


Freeman, 
Edition). New York: Holt, 1955. 


chological Testing. New York: Macmillan, 1954.. 
1; of Psychological Testing. New York: Harper, 
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Specific: Й E 

Arthur, С. A. А Point Scale of Performance Tests. Revised Form П. 
Manual for Administering and Scoring the Tests. New York: Psychologi- 
cal Corp., 1947. , 

MENGE Q. The Revision of the Stanford-Binet Scale: An Analysis 
of the Standardization Data. Boston: Houghton Mifflin, 1942. 

Terman, L. M., and Merrill, M. A. Measuring Intelligence. Boston: 
Houghton Mifflin, 1937. х Р 

Wechsler, D. The Measurement of Adult Intelligence (3rd edition). 
Baltimore: Williams and Wilkins, 1944. ; 

Wechsler, D. Wechsler Intelligence Scale for Children. Manual. New 
York: Psychological Corp., 1949. 


SUGGESTIONS FOR LABORATORY WORK 


l. Examine the Stanford-Binet items at ages 4, 8, 12, and Superior 
Adult. Classify the items at each age level as verbal, numerical, spatial- 
perceptual (for example, mazes and the like), and performance (manip- 
ulative). Add other categories if you need them. Which category has 
the largest number of items? 


2. Have members of the class pair off and test each other. Be sure to 
follow the Manual carcfully. Results from this "test" will not be indica- 
tive of mental ability, to be sure, but following the procedure is a good 
way to learn about the test. 


3. Repeat (1) and (2) for the Wechsler Intelligence Scale for Children. 
For (1), sample the items of cach test. 

4. Go over the Manual of the Arthur Point Scale. If materials are 
available, administer the Scale to a child before the class. 


QUESTIONS FOR DISCUSSION 


1. What importance do you attach to the fact that test items in 
Stanford-Binet become more “verbal” as we go up the age scale. 
2. Which test, Stanford-Binet or Arthur Point Scale, wo 
to prove more effective in the following Situations: 
2) selecting children for a special class for the gifted 
b) selecting children for remedial work in a "slow" 
с) studying children with reading problems 
d) testing children with speech defects 
3. A child taken from public school and entered in a private schoo] 
is reported by his mother to have shown an increase in IQ of 20 points 
after six months in the “new” School. Assuming the Story to be true, 


would you expect 


class 
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what is misleading about it? What might account for the change in the 


IQ? 
4. A high school boy of 16 has a Wechsler-Bellevue IQ of 132. What 
advice would be justified by this fact alone? 

a5. Look over the items in the WISC. Which do you think depend 


primarily on schooling? Do the same for the Stanford-Binet. Which test 


is the more "school centered”? 

6. Terman states that the vocabulary test gives the closest approxima- 
tion to total performance on Stanford-Binet. What does this tell us 
about the nature of the Stanford-Binet IQ? 

7. In deploring the reading interests, TV programs, and voting habits 
of the American adult, critics have said that the average mental age of 
the adult is about 14 (sometimes this is 12 or 15). What does mental age 
signify here, if anything definable? 

8. Does a child with an IQ of 80 possess 80 per cent of normal intelli- 


gence? Explain your answer. 


СНАРТЕЕ 4 


GROUP TESTS OF INTELLIGENCE 


Group and Individual Tests of Intelligence 


Group tests of intelligence are much like individual tests except 
that (1) they are administered like school examinations, and (2) 
they are objective in form—are answered by checking or circling 
a number or letter, or by marking one of several possible re- 
sponses. Group tests contain both verbal and non-verbal ma- 
terials, Items of the first sort are expressed in words and numbers; 
non-verbal test items, on the other hand, consist of problems 
presented in pictures and diagrams. There is a minimum of 
language and little or no reading required in non-verbal items. 
Intelligence tests for pre-school and first-grade pupils are of 
necessity non-verbal, though directions are given orally. Intelli- 
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gence tests in the elementary grades contain both verbal and non- 
verbal items. At the high school and college levels, test items are 
mostly verbal, mathematical, and abstract, but even here many 
problems are presented in pictorial and spatial forms. 

Group tests of intelligence confront the examinee with tasks 
like those found in the individual intelligence scales. Both types 
of test minimize routine school learning and emphasize mental 
alertness by presenting problems which demand reasoning, gen- 
eralization, and the manipulation of “ideas.” But there are differ- 
ences, too, between the two sorts of test. In individual intelli- 
gence scales, questions are stated orally and are answered orally; 
moreover, problems are presented one at a time without time 
limit, or a generous limit is allowed. In group intelligence tests, 
questions are printed in a booklet, time limits are fixed, and 
answers are limited to the options provided. The group test is 
more dependent on reading than is the individual test, it is less 
flexible in response, and it is often disturbing to children who are 
easily flustered by a time limit, When a child’s school work 
and/or the teacher’s opinion of his abilities do not jibe with his 
group test score, it may be advisable to check the group test 
result against the Stanford-Binet. Group tests, like individual 
scales, аге concerned almost entirely with the abstract level of 
intelligence (page 46). | 

The first group tests to be widely used were the two intelli- 
gence examinations developed for use in the army during World 
War I (1917-1918). Army Alpha consisted of cight sub-tests: 
Following Directions, Arithmetic Problems, Best Answers, Dis- 
arranged Sentences, Same-Opposites, Number Series Completion, 
Analogies, and Information. Army Beta made use of diagrams 
irections were given in pantomime. During 
Gencral Classification Test (AGCT) 
was developed as a measure of general ability. Unlike Alpha, 
the items in AGCT were not grouped no CO i ade 
printed in ascending order of difficulty. A civilian edition о 


AGCT is now available. 


and pictures, and d 
World War II, the Army 
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REPRESENTATIVE GROUP TESTS OF 
INTELLIGENCE 


This section will describe several tests of general intelligence 
covering the age range from pre-school to college. These test 
batteries have been chosen for illustration because they arc well 
standardized, are widely used in the schools, and are representa- 
tive of a large assortment of group tests designed to measure 
general ability. They are not necessarily the best mental examina- 
tions for every testing situation nor for every school. Selection 
of a “best” test will depend on the objectives which the school 
hopes to achieve, the time available for testing, and the money 
and personnel which the school has available. 


GROUP TESTS OF INTELLIGENCE 


Pintner-Cunningham Primary Test 
California Test of Mental Maturit 

Otis Quick-Scoring Mental Ability Tests 
Kuhlmann-Anderson Intelligence Tests 
Terman-McNemar Test of Mental Ability 


American Council on Education Psychological Examination 


س ados rS‏ ي 


1. The Pintner-Cunningham Primary Test* 


Description. This test includ 
scribed as follows: 

l. Common Observation: Child marks 
given set which fit into some 

2 Aesthetic Differences: 
“prettiest” (that is, best) of t 
(Figure 4-1, row 2.) 

3- Associated Objects: The child marks tl 
belong together in each row of pictures—as, for example, the 
hat and the coat. (Figure 4-1, row 3.) 


4. Discrimination of Size: The pupil is instruct 
+p 


©з seven non-verbal sub-tests de- 


all of the objects in a 
category. (See Figure 4-1, row 1.) 
The child is told to mark 


the 
hree drawings of the same obj 


есе 
ne two objects that 
ed to mark the 


ublished by the World Book Company, Yonkers-on-Hudson, N.Y. 


иил | 


The Pintner-Cunningham Primary Test 83 


FIGURE 4-1 Illustrative Items from the Pintner-Cunningham 
Primary Test 


D B Le ceo у 
{9 5 a 


Test 1. Mark the things thot Mother uses when she sews her apron. 


Test 2. Mark the prettiest house. 


TL 


Mark the two things that belong together. 


eo о о 9 

Test 7. Look at each picture. See how it is drawn. Make another one 
like it in the dots. 
Reproduced by permission of the World Book Company. 


items of clothing which are of the right size for the individual 
pictured. For each article of clothing—shoes, hat, gloves, etc.— 
one is too large, one is too small, and one is of the right size. 

5. Picture Parts: In this test a series of pictures of increasing 
complexity is shown. These contain children, toys, animals, and 
other items. The same items are shown outside the “standard” 
picture, mixed in with other objects. The child is instructed to 
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mark all of the objects in this group which appeared in the 
picture. 

6. Picture Completion: In each incomplete picture, the pupil 
is asked to locate and mark the correct missing part from among 
several parts shown. к 

7. Dot Drawing: The child is to copy drawings which are 
formed by joining dots. See Figure 4-1, row 4, 

All the tests are non-verbal, since most of the children for 
whom, the test is intended have not learned to read. Directions 
are given orally. 


Scope. The Pintner-Cunningham Primary Test covers the 
kindergarten, Grade I, and the first half of Grade II. There are 
three equivalent forms, A, B, and C. 


Scoring. Scores from the seven sub-tests are combined to give 
a total point score. Mental ages Corresponding to point scores 
may be read from tables in the Manual. Pintner-Cunningham 
MA'S are chronological ages for which the given point scores 
are typical (see page 33). These МА are divided by the child's 
CA to obtain an IQ. An alternate—and better procedure—is to 
convert the point scores into deviation IQ's, following the 
method used in the үу, chsler-Bellevue. The mean IQ is, of 
course, 100 and the SD is 16, equal to that of the Stanford- 


Binet. Pintner-Cunningham IQ's are not strictly equivalent to 
Stanford-Binet IQ's, though tł 1 


e of t measured by Stanford-Binet. The 
reliability or stability of Pintner-Cunningham Scores is high, as 
shown by the close correspondence of one form with another. 


2. The California Test of Mental Maturity (CTMM)* 
Description. These tests con 


tain both verbal and non-verbal 
materials. Sub-tests are groupe 


d under the following five heads: 


* Published by the California Test Burcau, Los Angeles, Calif. 
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memory, spatial relations, logical reasoning, numerical reasoning, 
and verbal concepts. Each of these categories is represented by 
from two to four tests. The profile in Figure 4-2 gives the names 
and classification of these sub-tests. The first three tests in each 


FIGURE 4-2 Profile for the California Test of Mental Maturity 
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California battery are designed to measure visual acuity, auditory 
acuity, and motor co-ordination. These tests, which are in a 
separate booklet, are rough screening devices intended to 
identify children too handicapped to be correctly classified by 
the test battery. 


Scope. The California test series covers the range from kinder- 

garten to college. The five batteries are as follows: 

1. Pre-primary level: kindergarten and first grade 

2. Primary level: grades 1-3 

3. Elementary level: grades 4-8 

4. Junior high level: grades 7-9 

5. Advanced level: grades 10-college and adult. 
These test batteries require about 1% hours working time. They 
are relatively easy to administer and to score. 


Scoring. Separate scores are obtained for each of the five areas 
(called “factors”) into which the sub-tests have been grouped. 
There are also scores (and mental ages) based on (1) the lan- 
guage or verbal tests alone, (2) the non-language tests alone, and 
(3) the test as a whole. From these three scores, separate MA’s 
may be read from tables in the Manual. Language, non-language, 
and total-test IQ's are found by dividing the appropriate MA 
by the child’s CA. Percentile ranks are also provided for each of 
the five “mental factors.” These PR's may also be read from 
appropriate tables. 

A special feature of the CTMM is the use of a profile or chart 
as an aid in analysis and diagnosis. As shown in F igure 4-2, the 
highs and lows of a pupil’s performance in the five areas may be 
readily seen from their positions on the profile. Along the right- 
hand margin of the chart, percentile ranks (PR’s) are entered 
for each factor, as well as for total score and for the language 
and non-language parts of the test. These PR’s give the student’s 
standing on a scale of one hundred points (page 33). If the PR 
is 50, the child stands just in the middle of his age group; if his 
PR is 70, then 70 per cent of his age group fall below him in 
the given test. 
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The validity of the CTMM was determined through its cor- 
relations with Stanford-Binet and other standard mental tests. 
The tests appear to be very homogeneous (to measure the same 
abilities) over the age range from pre-school to college. The 
reliability of the language, non-language, and total scores is high: 
reliability coefficients of these part scores and of the factor scores 
range from .87 to .95 over grades 4-6. 


3. Otis Quick-Scoring Mental Ability Tests* 


Description. These tests differ from many group tests of in- 
telligence in that the test items are not grouped into separate 
sub-tests according to type of item. Instead, the different items— 
analogies, arithmetic problems, opposites and the like, are printed 
in a continuous repetitive pattern, so that items of a certain sort 
(opposites, for example) follow each other at stated intervals. 
"This arrangement is sometimes called a "scrambled" test, or more 
precisely a spiral omnibus arrangement. Items are progressively 
more difficult from the start to the finish of the test. 

The following items are like those in the Otis Beta Test,** an 
exaniination prepared for grades 4-9. 

Н 11728 053345, 
1. Which of the five things below is soft? VE) СӘ (5) 
1. glass 2.stone 3.cotton 4.iron 5.ice 
leot suus 
2. A robin is a kind of (OLI) KOE) 1) 
1. plant 2. bird 3.worm 4. fish 5. flower 
1 2 3; ES 
3. Hat is to head as shoe is to СОКСО) 
l.arm 2.leg 3. foot 4. бс 5. glove 
1 2 З АШУУ 
4. North О ОГОО) 
1. Һос 2. саѕс 3.west 4.down 5. south 
1 2 3: L VS. 
5. Atfive cents each, how many pencils can be (OPES GONE NO 
bought for 40 cents? 

1.45 2.8 3.200 4.5 5.12 

* Published by the World Book Company, Yonkers, N. Y. r . 

** The first two items are samples from the Beta Test. Other items are like 
those found in the test. 


88 Group Tests of Intelligence 


Scope. The Otis Quick-Scoring Tests cover the age range from 
Grade I through college. There are three batteries, as follows: 
Alpha Test (90 items) non-verbal; grades 1-4 
Beta Test (80 items) verbal, numerical, and spatial; grades 4-9 
Gamma Test (80 items) verbal primarily; high school and 
college 


Scoring. The Otis tests are easy to administer, and scoring is 
facilitated by a cutout stencil which can be superimposed on the 
test booklet. The.tests are virtually "self-administering." There 

‚ isa single time limit, which varies from twenty to thirty minutes. 
Mental age equivalents to total score are read from tables in the 
Manual. The Otis IQ's are deviation scores and are measures 
of brightnéss. These IQ's are only generally comparable to 
Stanford-Binet IQ's; the two “scores” are not equivalent. The 
"reliability of the Otis tests is high. 


4. Kuhlmann-Anderson Intelligence Tests* 


Description. "This is a series of thirty-nine separate sub-tests 
grouped into nine overlapping test batteries. The sub-tests include 
verbal and non-verbal materials. The early levels are entirely 
pictorial, but the tests become more verbal and abstract as we 
go up the age scale and finally are entirely verbal. Fach test 
battery consists of ten sub-tests. 


Scope. Fach of the ten batteries is printed in a separate booklet 
and is designed to cover one or more grade levels, as follows: 
Kindergarten: sub-tests 1— 10 


Grade 1 sub-tests 4—13 
Grade 2 sub-tests 8—17 
Grade 3 sub-tests 12—21 
Grade 4 sub-tests 15 — 24 
Grade 5 sub-tests 19 – 28 


Grade 6 sub-tests 22 — 31 


* Published by the Personnel Press, Inc., 180 Nassau Street, Princeton, N.J. 
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Grade 7-8 — sub-tests 25 — 34 

Grade 9-12  sub-tests 30—39 
Administration of the K-A tests is somewhat more difficult than 
with the Otis, since the tests in the batteries are often separately 
timed. K-A requires from 30 to 45 minutes to administer. 


Scoring. In setting up a scoring plan, the authors of K-A have 
employed what is called the “median mental age” method. This 
may be described briefly as follows. Each of the ten sub-tests 
in a battery yields a mental age. These MA’s (see page 32) are 
chronological ages for which a given score is typical or average. 
Thus if the children who are 10 years and 2 months old earn 
in general a score of 21 on a given sub-test, then the score of 21 
corresponds to or is equivalent to a MA of 10-2 on this sub- 
test. MA's are read from tables in the Manual. The zedian MA 
is the median of the ten sub-test MA’s.* This is taken to be the 
most representative measure of a child’s over-all ability. 

An IQ for the battery is found by dividing the median MA 
by the child’s life age, or CA. This IQ is not equivalent to the 
Stanford-Binet IQ, though it is related to it. The K-A tests 
measure verbal or abstract intelligence primarily, especially at 
the upper age levels. The reliability of the K-A—as shown by 
the stability of its test scores—is very high. Reliability coefficients , 
have been computed for grades 1 though 9 separately: these range 
from .89 to .97. 


5. Terman-McNemar Test of Mental Ability** 


Description. This test is designed for high school students. It is 
a measure largely of ability to read and comprehend fairly diffi- 
cult prose. Two numerical sub-tests found in an earlicr edition 
of the test were eliminated in order to render the test more 
unified in content. As it stands, we have a highly verbal battery. 


* When ten scores are arranged in order of size, the point (or scorc) found 


by counting off five scores from either end of the series is the typical value or 


median (see page 20). 
** World Book Company, Yonkers-on-the-Hudson, N. Y. 
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There are seven sub-tests, described as follows: information, 
synonyms, logical selection, classification, analogies, opposites, 
and best answer. Sample items and instructions for each item 
type are shown in Figure 4-3. These items are easier than are 
the items found in the test proper and are for illustration. Items 
in the test are graded in difficulty from easy to hard. 


FIGURE 4-3 Sample Items from the Terman-McNemar Test 
: of Mental Ability (Form C) 


TEST 1. INFORMATION 


Mark the answer space which has the same number a» the word that makes the sentence TRUE. 
Saure. Our first President was 
Adams 2 Washington 3 Lincoln 4 Jefferson 5 Monroe... 


TEST 2. SYNONYMS 


Mark the answer space which has the same number as the word which has the SAME or most nearly 
the same meaning as the beginning word of each line. 


Saure. correct— 1 seat 2 fair 3 right 4 poor 


TEST 3. LOGICAL SELECTION 


Mark the answer space which has the same number as the word which tells what the thing ALWAYS 
has or ALWAYS involves. 


Saume. A cat always has 
1 kittens 2 spots, 3 mik 4 mouse 


TEST 4.. CLASSIFICATION 
1а each fine below, four of the words belong together, Pick out the ONE WORD which does not 
belong with the nthers, amd mark the answer spure bearing its number, 


ìl dog 2см 3 horse С С aaco с cat Еа 
Suurtes, 


Shop run  BSsund 9 skip 


TEST б. ANALOGIES 
‘Study the simples carefully. 
Ear is to hear as ese is to 


ley 2 glasses Зару 4 wink 
ETAT үү үт 
Gam 71р 8 foot oft 10 glove 


DO THEM ALL LIKE THE SAMPLES. 


TEST 6. OPPOSITES 
Mark the answer space which has the sime number as the word which is OPPOSITE, or most nearly 
ерге, in meaning to the Lezinning word r£ exch line. 
Samre. вее — 1 hot 2 east 3 west 4 down S south ESAS 


TEST 7. BEST ANSWER 


Read each statement and mark the answer 


‘space which has the «ime number as the answer which 
you think i» AIST, 


$хмтг. We should not pot а burning match in the wastelusket heraus 
1 Matches cost money. 2 We might need a match later. 
3 It might go out. 4 It might start a fire. 


Reproduced by permission of the World Book Company. 
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Scope. The Terman-McNemar is planned specifically for 
grades 7 through 12 and for college freshmen. 


Scoring. Total raw score is converted into a scaled score IQ, 
which is closely related to Stanford-Binet IQ. Scores may also 
be expressed as MA’s and as percentile ranks. The working time 
for the test is about 50 minutes. Terman-McNemar comes in 
two equivalent forms. In the ‘construction of the test, a careful 
item analysis was made (page 214) in order to weed out unsatis- 
factory items. This is offered as evidence of the test’s validity. 
The reliability of the Terman-McNemar is reported to be .96 
for a single age level. 


6. American Council on Education 
Psychological Examination* 

Description. This battery of tests is designed to measure scho- 
lastic aptitude, or learning ability in school. It comes in two 
forms, one for high-school students and another for college 
freshmen. The college test consists of six sub-tests, as follows: 


1. Arithmetic problems: 20 problems in multiple-choice form, 
of the “mental arithmetic” variety. 

2. Completion: 30 items in multiple-choice form. The test 
demands word knowledge and definitions. 

3. Figure Analogies: 30 multiple-choice items. Analogies in- 
volve geometric forms, areas, angles, spatial arrangements. 

4. Same-Opposite: 50 multiple-choice items which demand 
vocabulary and word knowledge. 

5. Number Series: 30 items to be completed "logically" with 
appropriate numbers. 

6. Verbal Analogies: 40 items: relation-finding in verbal terms. 


In the ACE, college form, sub-tests 1, 3, and 5 are combined to 
give a quantitative, or Q, score; sub-tests 2, 4 and 6 are combined 
to give a linguistic, or L, score. Each sub-test is separately timed 
and each is preceded by a practice exercise. In the high-school 


* Published by thg Educational Testing Service, Princeton, IN. J. 
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form of the ACE, tests 3 and 6 are dropped, leaving four sub- 
tests. Completion and same-opposite аге combined to give the L 
score, and arithmetic and number series to give the Q score. 


Scope. The ACI! is the most difficult of the general intelligence 
tests described so far. Testing time varies from about forty min- 
utes (high school form) to sixty minutes (college form). 


Scoring. The three scores from the ACE—the quantitative, the 
linguistic, and the total—may be converted into percentile ranks. 
Extensive norms (in PR’s) are published annually covering test 
results from previous years. О scores have been found to cor- 
relate with achievement (grades) in mathematics and science, 
but the L score has the higher correlation with general achieve- 
ment in high school. 

The predictive validity (page 31) of the ACE, as determined 
over several years, is high. ACE correlates from .40 to .60 with 
college grades; its correlations with Stanford-Binet average about 
.65. The reliability coefficients of Q, L, and total score are all 
very high. One feature of the ACE is the publication of norms 
for different groups. Separate norms are available for boys and 
girls and for three types of college—4-year, 2-year (junior), 
and teachers’ colleges. Although the 4-year colleges achieve 
higher mean scores, there is much overlapping of 4-year, 2-year, 
and teachers’ college scores. Differential norms are a distinct aid 
to educational counselors. 


HOW GROUP INTELLIGENCE TESTS ARE USED 
IN THE SCHOOL 


Survey Measures 


In general, the group test of intelligence is used (1) to give 
an over-all measure of a child's abstract ability (often an IQ), 
(2) to provide a basis for educational counseling and guidance, 
and (3) to give a basis for prognosis. The total score on a group 
test is useful to the school administrator, the classroom teacher, 
and the parent. Standard tests supply the school administrator 
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with a systematic record of how different schools, and classes 
within a given school, compare in general ability to learn. The 
classroom teacher gets necded information concerning the abili- 
ties of individual pupils. Within a given class, the spread of 
ability is often disturbingly wide. A teacher can -tell from his 
test scores whether John and Mary are doing the caliber of 
work which can reasonably be expected of them, and whether he 
is pitching his instruction at the comprehension level of the. class 
as a whole. Parents can plan the future education of their chil- 
dren тоге intelligently when they know the level of perform- 
ance to be expected of them. And students can set their academic 
and occupational goals more realistically when they are aware 
of their strengths and weaknesses as shown by comparison of 
their scores with norms for their age level. 


Counseling, Guidance, and Prognosis 


The total score from a group test—the IQ or other type score 
—is most useful as a measure of a pupil’s over-all academic ability. 
For guidance and counseling the teacher can use to greater ad- 
vantage the sub-tests or part scores from the test Lattery. The 
profile of the California Test of Mental Maturity, for instance, 
has been especially designed for diagnosis. From the language 
and non-language IQ’s, a teacher can judge whether a child is 
predominantly *verbal-minded" or “object-minded”; and from 
the five “factor” scores on the profile he can judge how: pro- 
ficient a pupil is in memory, logical reasoning, verbal concepts, 
spatial-perceptual relations, and numerical reasoning. In Figure 
4-2, for example, low scores in reasoning and vocabulary indi- 
cate poor academic ability—that is, the pupil lacks the ability 
to solve problems efficiently by means of symbols (numbers 
and words). High scores in these factors reveal good academic 
aptitude and, when combined with other traits, suggest that the 
pupil is capable of more advanced, and perhaps professional, 
training. Low scorcs in spatial relations’ reveal little promise of 


success in geometry, mechanical drawing, and perhaps manual 
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training. High scores here, plus high scores in the other factors, 
forecast aptitude for engineering and architecture. The memo. 
factor is based on too meager a sample to provide a reliable 
measure of a pupil's functional memory; the score here might 
be significant, however, if very high or very low. 2H 

Part scores, like those from the CTMM, are helpful in giving 
the classroom teacher clues as to a child's abilities. But these 
scores must always be interpreted with caution (page 68). The 
sub-tests upon which such judgments are based are quite short 
and are often too narrow to permit of a broad prediction. Marked 
differences in part scores should always be substantiated by 
further investigation; thcy should Jibe with other tests, with 
grades, and with the teacher’s judgment from observation of the 
pupil’s classroom work. 

The Otis Quick-Scoring Mental Ability Tests and the Kuhl- 
mann-Anderson Intelligence tests are primarily useful as over-all 
measures of the general level of inental functioning. The sub- 
tests of the Kuhlmann-Anderson are fairly complex. The authors, 
very wisely, do- not recommend that specific scores (mental 
ages) from sub-tests be interpreted as measuring definite psycho- 
logical functions, Wide variations in score from sub-test to sub- 
test for a given child may be significant, however, of gaps in 
training or in native ability. 

The, Pintner-Cunningham Primary Test is most useful, per- 
haps, in helping the teacher ‘and parent decide whether a child 
is mature enough mentally to do first-grade work. Entrance into 
first grade should not depend solely upon the MA or CA, how- 
ever. Children who are babyish in their social behavior and 
poorly developed physically are poor prospects for first grade, 
no matter how high their 1Q’s. 

Because of its high verbal content, the Terman-McNemar 
Test of Mental Ability is one of the best predictors of high- 
school achievement. The homogeneity of the sub-tests (their 
high degree of relatedness) renders the test less useful for diag- 


nosis of a student’s strengths and weaknesses, The American 
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Council on Education Psychological Examination is a good pre- 
dictor of college work. This battery measures initiative in attack- 
ing new problems, and mental speed and facility and good work 
habits, as well as abstract ability. ACE is also useful in guidance, 
since it provides three scores—a quantitative (Q), a linguistic 
(L), and a total. The ACE for high-school students is used as 
a screening test for prospective college freshmen and as a basis 
for counselling high-school students who plan to continue their 
education beyond high school. The L score is perhaps most 
predictive of general college work, because of the great impor- 
tance of reading comprehension in college courses. The Q score 
has predictive value for science and mathematics, especially when 
confirmed.by other indicators (grades, teachers' judgment). 
The ACE (page 91) provides separate norms for 2-year, 
4-year and teachers’ colleges. The 4-year colleges have the 
higher average scores, but variation in score from one type 
of college to another is very large, as is also variation in score 
within each college type. A student’s chances of entering college 
and staying there will depend to a considerable degree upon the 
college he chooses (see page 115 for discussion of local and. 
nation-wide norms) or to which he is admitted. Only superior 
students should be encouraged to apply to high-standard col- 
leges, and not all of these are good risks unless they have the 
personal qualities to go along with academic potential. Good 
personality and a capacity for hard work: may not, in them- 
selves, enhance a student's chances of being accepted into an 
A-grade college, but they will help him stay there once he is in. 
Students with relatively low scores on ACE may be quite suc- 
cessful in colleges in which the scholastic standards are not too 
high. In any event, knowledge of his academic strengths and 
weaknesses should be helpful to a student, whether he plans 
further school work or not. 
Norms for the ACE for high-school students are based upon 
selected groups and may be much too high for all high-school 
seniors. In fact, kis PR on the ACE may be unfair (misleadingly 
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low) for the high-school senior of modest intellectual endow- 
ment who does not plan to enter college. Such a youngster may 


rank well up among 18-year-olds in the population but relatively 
low among those of college caliber. . 


Limitations of the Group Intelligence Test 


Intelligence tests have definite limitations, and teachers and 
parents must not expect the impossible from them. For one 
thing, a group intelligence test cannot increase intelligence, as 
parents sometimes seem to think it should. Again, a group test 
IQ is not necessarily a good measure of a pupil's drive to accom- 
plish, or of that dogged determination to stick to an unpleasant 
task and see it through. Nor is a fairly high IQ (even a high IQ) 
always accompanied by emotional stability, good judgment, and 
initiative. АП these traits are related to good intellect, but the 
relationship is by no means perfect. Many persons of average 
intellectual ability succeed in college, whereas many of greater 
potential fall by the wayside. Intelligence is a necessary, but 


is not a sufficient, attribute for high accomplishment in school 
‘or in life. 


WHAT TO LOOK FOR IN A GROUP 
INTELLIGENCE TEST 


The adequacy of a group intelligence test is judged by its 
validity, reliability, Scoring methods, and norms. Th 
giving the test, its cost, and such factors as time an 
must also be considered. 


е object in 
d personnel 


Validity. A test is valid, as we have noted, if it measures what it 
purports to measure (page 30). Group intelligence tests have 
been validated, in general, against various criteria judged to be 
indicative of intellect (page 31). Some of these criteria are 
school grades, ratings for ability, and other intelligence tests, 
All such criteria are admittedly indirect and fallible; at the 
same time, they represent measures with which any authentic 


یت 


Limitations of the Group Intelligence Test 97 


test of intelligence must correlate. Perhaps the best criterion of 
the validity of a group test is its success in predicting perform- 
ance in tasks judged to require intelligence—in school, in busi- 
ness, in the armed forces, or in a profession. Judged by correla- 
tional criteria and predictive power, most of the widely used 
group intelligence tests may be accepted as valid, though never 
perfectly so. 


Reliability. We have already had occasion to use the term 
reliability with reference to individual intelligence tests. If a 
child earns an IQ of 108 on one form of a group test and three 
months later achieves an IQ of 106, or 108, or even 110 on a 
second form—that is, scores within a few points of the first 
determination, and if most persons examined show similarly 
consistent results, we regard the test as reliable. Reliability de- 
pends essentially upon the stability or consistency of a score. 
When properly given and scored, most standard group intelli- 
gence tests are highly-reliable. 


Scoring. Group intelligence tests are first scored in arbitrary 
points, one or more points being assigned to each correct an- . 
swer. Point or raw scores are frequently converted into MA’s 
and IQ's. Such IQ’s are related to, but are not equivalent to, 
Stanford-Binet IQ's. Group test IQ’s are adequate for screening 
and often are satisfactory for guidance; but the individual intelli- 
gence test IQ is a more searching and more nearly constant 
measure of a child's talents (page 61). In addition to MA’s 
and IQ's, many group tests also provide PR's for raw or obtained 
scores. These PR's are readily interpreted: they show how high 
the pupil ranks on a scale of one hundred points. If a high-school 
senior has a PR of 85 in the (L) part of the ACE and a PR of 
80 on the (Q) part, he should be a good risk for college work. 

A second way of rendering the scores from different sub-tests 
in a battery comparable is through the use of standard scores. 
Point scores may be converted into a standard score scale with 
a convenient mean and c. The sub-tests of a group intelligence 
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test usually differ in content, in length, and in difficulty. These 
part-scores cannot be compared—or combined—as they stand. 
But when converted into a common scale, they can be added to 
give a total in which each sub-test has the same weight. 


Norms. Norms (page 40) are typical measures of achieve- 
ment. Norms may be nation-wide or local (page 115). Local 
norms are often fairer for a given group in that they take into 
account the conditions within a given city or state. National 
norms are most useful for wide comparisons and as standards at 
which to aim. Norms for college freshmen will generally be 
much too high for high-school graduates in general. College 


FIGURE 4-4 Norms for Various Occupational Groups on the 
Army General Classification Test 


Civilian Occupation AGCT Standard Score 


$0 70 80 90 100 по 120 130 140 150 


Accountant 
Medical student 
Teacher E 
Lawyer 

Bookeeper, generol 
Stenogropher 
Reporter 

Clerk, general 
Purchasing agent 
Salesman 
Telephone repairman 
Artist 


Toolmaker 

Printer. 

Machinist 

Policeman 

Soles clerk 

Electric 

Machinist's helper 
Welder, combination 
Plumber 

Carpenter, general 
Automobile repairman 
Tractor driver 
Painter, general 
Truck driver, heavy 
Cook 

Laborer 

Barber 

Miner 

Farm worker 


Men in general 


10th Percentile i 90th Percentile 
25th Percentile 75th Percentile 


Reproduced by permission of Harper & Brothers. 
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freshmen are a selection of high-school graduates according to 
academic proficiency. Group intelligence-test norms are usually 
given in terms of age level, but they may be in terms of grade 
level. 

Figure 4-4 shows the norms for certain occupations on the 
Army General Classification Test, used in the armed forces in 
World War II. The higher scores are achieved by men with 
the most extensive training, and are probably the resultant of 
both intelligence and training. The more intelligent men are able 
to undertake the more exacting training, and this training enables 
their native talent to express itself. It is interesting to note the 
large degree of overlapping in score from one occupation to 
another. It seems evident that many men are functioning at a 
level below their native capacity. 

The educational expectation of a child whose group test IQ 
is 90, 100, or 115 may be read with sufficient accuracy for most 
purposes from Table 3-3. 


Other Factors Which May Govern the Choice of a Group Intelli- 
gence Test. In addition to the formal requirements to be met by a 
group test discussed above, there are other considerations which 
enter into the suitability of a test for a given school system. 
Among the more important are time available for testing, per- 
sonnel, cost, and acceptability. Catalogues provide data on cost 
and time allowances—most testing periods are set to fit com- 
fortably into a class period. In most cases, teachers can administer 
group tests with a minimum of instruction, and scoring can be 
done with stencils. Acceptability of a test depends on whether 
the teachers and the community look with favor upon standard 
tests. Much of the disfavor with which parents once regarded 
mental tests has fortunately disappeared, though one still encoun- 
ters skepticism as to their value. In initiating a testing program 
it is always wise to avoid tests which contain what appear to be 
trick items and those which resemble puzzles. Such tests are 
likely to be labeled frivolous by teachers and parents. Some 
parents still think that the object of a mental test is to describe 
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their children as dull or mentally abnormal. When they see the 
value of a standard test in providing a better understanding of a 
child’s capabilities, their objections disappear. From the cata- 
logues listed on page 253, the teacher or administrator should be 
able to find the test suitable for a given situation, 


SUGGESTIONS FOR FURTHER READING 


Cronbach, L. J. Essentials of Psychological Testing. New York: 
Harper, 1949. 


Freeman, F. S. Theory and Practice of Psychological Testing (Rev. 
edition). New York: Holt, 1955. 

Goodenough, F. L. Mental Testing. New York: Rinehart, 1949. 

Noll, V. H. Introduction to Educational Measurement. Boston: Hough- 
ton Mifflin, 1957. 


Thorndike, R. L. and Hagen, E. Measurement and Evaluation in 
Psychology and Education. New York: John Wiley, 1955. 


SUGGESTIONS FOR LABORATORY WORK 


1. Administer three or four standard group tests of intelligence to the 


class and have the students score their own papers. If the test is for 
young children, cut the time limits in half. 

2. Select one of the tests taken in (1). Examine the Manual for the 
author's treatment of validity, reliability, scoring methods, and norms. 
Summarize these data. i 


ies In another of the tests from (1), count the number of items which, 
your opinion, are verbal, numerical, and spatial-perceptual. In which 


group did you do best? Worst? D. ji i 
рше EA oes your result jibe with what yor 


QUESTIONS FOR DISCUSSION 


estimate of a pupil’s abili 
3. Why do we 
intelligence tests? 


ty than is the rating 
get different IQ's for the Same pupil from different 
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5. Suppose you are a sixth-grade teacher. You have administered’ a 
standard group-intelligence test to your class. What uses do you think 
you might make from a knowledge of these children's IQ's? 

6. A pupil has taken the CTMM (page 84). In counseling this child, 
what help might you get from a wide difference in his language and non- 
language IQ's? 


i 


CHAPTER 5 


EDUCATIONAL ACHIEVEMENT TESTS 
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like. The educational achievement test is also concerned with 
mental processes, but only insofar as they are demonstrated in 
a student’s performance in English composition, arithmetic, his- 
tory, or science. і 

The distinction between the two sorts of test is not always 
clean-cut, and there is much overlap in content and in abilities 
called upon. All intelligence tests depend in some degree on 
previous learning, and all educational tests depend in some part on 
native keenness. Educational achievement tests predict future 
school performance as well as or better than intelligence tests, 
Achievement in the elementary school, for example, forecasts 
achievement in high school; and performance in arithmetic pře- 
dicts later performance in algebra. But prediction is strengthened 
when an intelligence test is added to the achievement battery. 
Perhaps the general intelligence test is most useful when we want 
an estimate of potential aptitude, the achievement test when we 
want a measure of present school standing and probable success 
in later school work. Both tests provide valuable information, and 
each supplements the other. 

Educational achievement tests are useful (1) for survey pur- 
poses—that is, to determine a class’s standing in relation to some 
norm, and (2) for guidance and evaluation—that is, to provide a 
clearer understanding of what individual pupils have learned—or 
failed to learn—in specific school subjects. A better understanding 
of strengths and weaknesses is a major objective of a testing pro- 
gram. Remedial work can be undertaken more intelligently and 
teaching improved when we know what errors a pupil is making 
consistently and what misconceptions and gaps in traihing led to 
these errors. 

Achievement tests are often used for sectioning pupils in order 
to improve working conditions within the classroom. Thus pupils 
may be classified into high, average, and low ability groups on the 
basis of over-all educational, standing, or sectioned within a 
grade into fast, medium, and slow learners. Predation of later 
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school success on the basis of educational achievement tests is 
considerably more accurate than are forecasts based on conven- 
tional school marks. : 


THE SUPERIORITY OF STANDARD ACHIEVEMENT 
TESTS OVER ROUTINE EXAMINATIONS 


Standard Achievement tests are superior to teacher-made tests 
in three principal respects. 


1. The Achievement Tesi Is Better Planned. The usual teacher- 
made test in algebra or French is composed of questions and 
problems covering topics which one teacher believes worth know- 
ing about his subject. Usually materials are drawn from a single 
textbook. Such a test is valuable as a measure of progress in learn- 
ing, but it is not very broad in coverage and does not permit 
comparisons with the achievement of students in other schools. 

The standard educational achievement examination, on the 
other hand, is compiled after an analysis of many widely used 
textbooks and various courses of study and sets of examinations. 
"Thus it represents a consensus—the pooled judgment of many 
competent teachers and testing specialists. Drawing materials 
from many sources insures a representative sampling of subject 
matter. Occasionally a teacher will complain that a general 
achievement test contains questions about topics or books (in 
English literature, for example) which his class has not studied, 
and that on this account the test is unfair. This is often true, but 
the criticism is not as damaging as it may seem. Few classes have 
covered equally well ai] of the topics treated in a comprehensive 
achievement test. Some teachers will have emphasized one topic, 
some another, but by and large these inequalities will even up for 
the test as a whole. Rarely will a school have a general and marked 
advantage (or disadvantage) over another school in educational 
experience, unless the teaching, the curriculum, and/or the caliber 
of the students are exceptionally good or poor. When gross 
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inequalities are revealed in test scores, the reason for such differ- 
ences should be sought. It seems hardly wise on that account to 
abandon the test. 


2. The Achievement Test Is More Objective. The standard 
achievement test is more objective than the teacher-made ex- 
amination. This means that in an achievement test, grades received 
by students depend to a minimum degree on the personal opin- 
ions, likes, and dislikes of the scorer. In the traditional essay 
examination, a high degree of subjectivity is almost inevitably 
present: the mark given an answer depends on what one teacher 
regards as important and significant. 


3. The Achievement Test Lays Down More Exact Specifications. 
The educational achievement test is more logically planned than 
the ordinary teacher-made examination, because makers of stand- 
ard tests draw up specifications for an examination. These lists 
are often lengthy and quite specific, but in general they can be 
reduced to two—knowledge and application. Thus test items are 
selected to reveal a pupil’s information and understanding of 
facts, as well as his acquired skills in, for example, reading or 
arithmetic. Again, items are chosen to-reveal a pupil's ability to 
apply known principles, to interpret, draw conclusions from 

‚ given data, and solve problems. The second of these specifications 
is the more important, but the first is not to be dismissed lightly 
as being a matter of “mere memory.” Students cannot write 
good English prose, nor can they read difficult passages in history 
and literature, without adequate vocabulary. Even in so “logical” 
a subject as mathematics, а student cannot solve “originals” in 
geometry (no matter how bright he is) unless he knows the 
preceding propositions. Rote memory, of course, is rarely 
enough. The older spelling bees found how тапу detached and 
isolated words a child can spell—though often he had little idea 
of what the words meant. Modern spelling tests try to discover 
whether a child can spell a word and also knows its meaning well 
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enough to use it correctly in a sentence—that is, in context. The 
second method (application as well as knowledge) provides a 
better measure of a child’s usable vocabulary. 


GENERAL EDUCATIONAL ACHIEVEMENT 
BATTERIES 


The present section will describe five representative achieve- 
ment test batteries (chosen from many) which are designed to 
measure: general educational achievement in the elementary 
grades and in the high school. i 

1. The Stanford Achievement Test (SAT)* 

2. The Metropolitan Achievement Tests (MAT) 

3. The California Achievement Tests (CAT) 

4. The Cooperative General Achievement Tests (GAT) 
5. The Sequential Tests of Educational Progress (STEP) 


All these tests make some provision for the analytic study of a 
student’s strong and weak points through a comparison of sub-test 
scores. Part scores are often represented comparatively on a 
graph or profile. 


The Stanford Achievement Test (SAT)** 


Description. The SAT consists of overlapping sub-tests 
grouped at four ability levels from grade 2 through grade 9. All 
four of the batteries contain three tests of paragraph meaning, 
word meaning and spelling (these are essentially measures of 
language skills); and two tests of arithmetic reasoning and 
arithmetic computation (number or quantitative skills). All these 
tests are multiple-choice in form. In addition to these five sub- 
tests, the Intermediate Battery (for grades 5 and 6) and the Ad- 
vanced Battery (for grades 7, 8, and 9) include four other tests: 
language, social. studies, natural science, and study skills. The 


* These batteries are often referred to in abbreviated form by the capital 
letters. ү 


°° Published by the World Book Company, Yonkers, N. Y. 
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Language Test contains items in capitalization, punctuation, and 
sentence structure. The Social Studies Test covers fundamentals 
of history, geography, and civics. The Study Skills Test is an 
ingenious attempt to discover how well a student reads maps, 
interprets graphs and tables, and uses references. This informa- 
tion is important to the teacher, since many pupils regularly skip 
all tables and graphs unless supervised. \ 

There are five forms for each battery. The Primary Battery 
is printed in a single booklet of eight pages and takes a little more 


FIGURE 5-1 Sample Items from the Stanford Achievement 
Test, Primary Battery, Form K 


Test 1. Paragraph Meaning. 
Directions: "Find the one word that belongs in each space, and draw a line 
under the word. Do not write in the spaces.” 


Baby pets me. 
1 drink milk. 


1 say “Mew, mew.” 


Cow kitten pony child 


Test IV. Arithmetic Reasoning. 
Directions: "Now look at the pictures. Put your finger on the little chair 
in the top box. That is right. Next to the little chair are some candles, 

Put a cross on the shortest candle. Make a mark like this.” (Illustrate 
on the board, making a large X). 


^ Lal 


"Do you see the row of clocks? Put a big cross on the clock thot says it is noon." 


Reproduced by permission of the World Book Company. 
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than two hours to administer. Figure 5-1 shows some of the items 
from the Primary Battery. The Elementary Battery (for grades 3 
and 4) contains six sub-tests: paragraph meaning, word meaning, 
spelling, arithmetic reasoning, arithmetic computation, and lan- 
guage. The Intermediate Battery requires almost four hours and 
will, of course, have to be spread over several class periods. The 
authors of the tests have drawn up a convenient testing schedule, 
with approximate times for each sub-test. 


Scope. The scope of the SAT is as follows: 


1. Primary Battery: end of grade 1, grade 2, and first half of 
grade 3 

2. Elementary Battery: grades 3 and 4 

3. Intermediate Battery: grades 5 and 6 

4. Advanced Battery: grades 7, 8 and 9 


"These four achievement tests cover the fundamentals taught in 
most schools over the elementary grades through grade 9. 


Scoring and Norms. All of the sub-tests are objective in form, 
so that scoring can be readily accomplished by stencils or scor- 
ing keys. Norms are in grade equivalents to raw scores, and also 
in percentiles for sub-test scores. 

"There are two types of norms. The first, called the zzodal-age 
grade norm, is recommended for individual diagnosis, that is, 
for evaluating the scores of individual pupils. From tables in the 
Manual, a pupil's scores can be compared with those earned by 
children who are £ypical for age and grade. A second norm, the 
total-group grade norm, is based upon the performance of all 


children in a given grade. These norms, given in tables in the . 


Manual, are recommended by the authors when one wishes to 
evaluate a class average. Raw scores on the sub-tests are con- 
verted into standard score units so that they may be combined 
and compared (page 38). 

The validity of the SAT is high. The tests possess content 
validity and the correlations of the batteries with grades and 
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other criteria demonstrate excellent predictive validity. The re- 
liability of the various batteries is also satisfactory. 


The Metropolitan Achievement Tests (MAT)* 


Description. The MAT includes five test batteries with a range 
from grade 1 through the first half of grade 9. АП of the test 
batteries contain sub-tests of reading and arithmetic; spelling is 

' added after grade 2, and language usage after grade 3. At the 
intermediate and advanced levels there are ten sub-tests in all: 
reading, vocabulary, arithmetic fundamentals, arithmetic prob- 
lems, English, literature, social studies (history), social studies 
(geography), science, and spelling. In addition to the complete 
batteries, partial test batteries are available for use at the inter- 
mediate and advanced levels. These include the skill subjects— 
reading, arithmetic, English, and spelling—plus vocabulary and 
arithmetic problems. All tests at a given level are printed in a 
single booklet. 

The MAT provides a comprehensive survey of a pupil’s educa- 
tional attainment. Moreover, the profile chart (see Figure 5-2) 
printed on the last page of the test booklet and the class ability 
sheet allow the teacher to identify the student’s weak points, to 
correct errors consistently made, to study a pupil's. rate of 
progress from time to time, and to group pupils for instruction 
or review. Tests in arithmetic and reading are available as sep- 
arates and may be used when it is not feasible to administer the 
whole battery. 

Scope. MAT includes the following batteries: 


1. Primary Battery I: grade 1 and beginning grade 2 

2. Primary Battery II: grade 2 and beginning grade 3 

3. Elementary Battery: grades 3 and 4 and beginning grade 5 

4. Intermediate Battery: grade 5 up to the first half of grade 7 

5. Advanced Battery: grade 7 up to the first half of grade 9 
MAT covers a wide range of material taught in grades 1-9. Test 


* Published by the World Book Company, Yonkers, N. Y. 
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INDIVIDUAL PROFILE CHART 
METROPOLITAN ACHEIVEMENT TESTS: INTERMEDIATE BATTERY —COMPLETE 


Aga Equivalent Scale 
Grade Equivalent Scale 


55 


ооо АКА.‏ ف о; >= © VW P OO IO‏ میاو ی د 


Жап 


TI 


"esL 


Ona Ao: 


Reproduced by permission of the World Book Company. 
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batteries require from one hour (primary) to about four hours 
(advanced). 


Scoring and Norms. The MAT is easy to administer and to 
score. There are three types of norms: age, grade, and percentile. 
Norms are given also in a standard score scale which is based 
on the assumption of a normal distribution of test ability in the 
sixth grade. Standard or scaled scores are comparable from bat- 
tery to battery in the same subject, but not from test to test within 
a given battery. Figure 5-2 shows the profile of George Fergu- 
son, who is 11 years and 8 months old. George is a sixth-grade 
student, and the MAT was administered on February 6 when 
he was midway through the grade (that is, at 6.5). George's 
scores on the ten sub-tests have been converted into age- 
equivalents from the appropriate tables in the “Key and Direc- 
tions for Scoring.” His subject ages (also called educational ages 
or EA’s) have been entered on the chart and joined by short 
straight lines to give the profile of his school achievement. A 
straight line drawn horizontally across the chart through George's 
chronological age of 11-8 shows immediately in what subjects 
he is above or below the scores typical for his age level. 

George's raw scores were converted into age- instead of 
grade-equivalents. These EA’s show whether George is acceler- 
ated or retarded as compared with children of his own age. EA's 
are useful in guidance. Grade equivalents give the grade levels to 
which various scores correspond. A. profile plotted from grade- 
equivalents tells us whether a pupil is above or below his present 
grade level in his various subjects. Both norms are useful. Grade 
norms are especially useful when comparisons with national or 
local norms are to be made; age norms are most useful when 
diagnosis of a pupil's strengths and weaknesses is wanted. 

Both the validity and the reliability of the MAT are satisfactory 


as judged by the usual criteria. 


The California Achievement Tests (САТ)* i 
Description. The CAT have been organized into five batteries 


* Published by the California Test Bureau, Los Angeles, Calif. 
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designed to cover the ability range from grade 1 to college. The 
tests are survey in nature and are concerned primarily with skills 
in six areas: reading vocabulary, reading comprehension, arith- 
metic reasoning, arithmetic fundamentals, mechanics of English 
and spelling. The authors of CAT believe that tests in these areas 
are more valuable than are tests in such subjects as social studies, 
where the content varies widely from school to school. The 


California Tests emphasize power rather than speed, the time. 


required for the Elementary Battery being more than two hours. 
CAT stresses the use of the separate tests in diagnosis. Except 
in the case of spelling, for example, the tests in the six areas are 
subdivided into sections, each dealing with some important aspect 
of the subject. For example, in the Elementary Battery, reading 
comprehension (Test 2) is analyzed into (1) following direc- 
tions, (2) reference skills, and (3) interpretation of material. 
Test 3, arithmetic reasoning, is broken down into (1) meanings, 
(2) signs and symbols, and (3) problems. Scores from each of 
these sub-divisions are plotted on a profile like that of Figure 5-2, 
usually in grade-equivalent units. The analysis of a pupil’s per- 
formance is carried still further by a second grouping together 
of items which presumably measure essentially common func- 
tions. Thus within the division of punctuation under Test 5, 
mechanics of English, items are grouped into those which in- 
volve commas, periods, question marks, quotation marks, Under 
the heading of addition in Test 4, arithmetic fundamentals, items 
are grouped under zeros, carrying, fractions, and decimals, Fhe 
number of item classifications under a given test varies from 50 
to more than 100. A special chart enables the scorer to analyze 
the pupil’s achievement over a wide range of these elements. 
Careful examination of specific item-groups may, to be sure, 
reveal why a pupil fails consistently to use decimals correctly 
or to understand fractions; or it may tell us where he is weak in 
punctuation, or in vocabulary, or in spelling. The CAT at least 
makes an attempt to keep the individual pupil from being lost in 
an “average.” At the same time, it must be remembered that in- 


мд. 
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dividual diagnosis based on a few items is always tentative and 
may be misleading. 


Scope. The CAT consists of the following batteries, which 
cover the educational levels described. 


1. Lower Primary: grades 1 and 2 
Upper Primary: grades 3 and 4 
Elementary: grades 4, 5, and 6 
Junior High: grades 7, 8, and 9 
Advanced: grades 9 to 14 


vA wN 


Scoring. Raw scores on the tests may be converted by tables 
into age, grade, and percentile-within-grade norms. The sub-tests 
are objective in form, easy to administer, and easy to score. The 
six tests of the batteries have satisfactory reliability, but the re- 
liabilities of the various sub-divisions are quite low because of the 
few items included in some groupings (often only one or two). 
Validity is high for the whole test. 


Co-operative General Achievement Tests (GAT)* 


Description. These achievement tests deal with three fields or 
areas— Test I covers social studies, Test II, natural sciences, and 
"Test III, mathematics. Each test battery consists of two divisions: 
Part I, which deals with fundamental terms, concepts and defini- 
tions; and Part IL, which covers applications of knowledge, in- 
terpretation, and comprehension. The battery has been planned 
for grades 10, 11, and 12, but it is probably too difficult for all 
but superior tenth and eleventh graders. The battery is objective 


in form throughout. 


Scope. GAT is a poWer test designed for the upper school 


grades and for college freshmen. Fach test requires from 40 to 
2 


60 minutes. 
Scoring. The tests аге all multiple-choice, and are easy to 
administer and to score. Items are graphic, pictorial, and verbal. 


* Published by the Educational "Testing Service, Princeton, N. J. 
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Norms in scaled scores and percentiles are given for high-school 
students and college freshmen. GAT is probably most useful in 
the counseling of high-school students as to the subject fields in 
which they show the greatest promise.. 


The Sequential Tests of Educational Progress (STEP)* 


Description. As the term “sequential” implies, this battery is 
designed to measure a student’s progress in learning as he goes 
from the elementary grades to college. The tests deal with 
critical skills in seven academic areas: essay tests, planned to 
provide standardized tests in writing prose; listening compre- 
hension tests, in which the examiner reads a passage and asks ques- 
tions designed to call out comprehension, interpretation, and 
evaluation; reading tests, covering a wide range of content; writ- 
ing tests, planned to measure the student's ability to express ideas; 
mathematics tests, which contain items over a wide range of 
subject matter and difficulty; science tests, dealing with the appli- 
cation of scientific knowledge to a variety of situations; and 
social studies tests, designed to show progress in social and civic 
development. 


Scope. STEP is designed to méasure achievement over the 
following levels: 


Level I—freshmen and sophomore years of college 
Level 2—grades 10, 11, and 12 

Level 3—grades 7, 8, and 9 

Level 4— grades 4, 5, and 6 


* Published by the Educational "Testing Service, Princeton, N, J. 
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stencils. A profile chart allows the examiner to analyze a pupil’s 
performances on the several functions measured by the battery. 


GENERAL ACHIEVEMENT TESTS 
IN THE SCHOOLS 


We have seen how the general educational achievement test 
gives the academic level of a pupil or of a class, and how the test 
profile reveals strengths and weaknesses in a variety of subjects 
and processes. Further illustration of how educational achieve- 
ment tests máy be utilized in (1) evaluation, (2) diagnosis, and 
(3) prediction will be given in this section. 


Evaluation. Suppose that Miss Clark has given the SAT to her 
sixth-grade class of twenty-six pupils. She finds her class mean 
(average) on the test battery to be about equal to the local norms 
for the sixth grade, but slightly below the national norm as given 
in the Manual. Does this result mean that Miss Clark is doing a 
poor job because local norms are less valuable than national? The 
answer is No, since a number of factors affect achievement in a 
given school system or a single school, and some of these may 
cause local norms to be lower or higher than national. Among 
these factors are the following: 


1. Retardation as a consequence of strict promotional stand- 
ards and practices. Much retardation will lower local norms, 
whereas the weeding out of poor students (by transfer to 
special classes, for example) will raise local norms. 

2. Promotion by age irrespective of achievement. This fairly 
common practice will lead to a progressive lowering of 
local grade norms. ye Ж, 

3, Previous experience of pupils with standard objective tests. 
This factor varies widely and often affects local norms. 

4. Coaching in the tests themselves. Sometimes teachers coach 
pupils in materials akin to ог identical with those found in 
the tests. “Teaching for the tests” is bad practice and should 
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be discouraged whenever possible. Coached pupils usually 
raise the class’s performance. 

5. Selection. Children from a poor socio-economic back- 
ground generally score lower on standard tests, whereas 
children from good neighborhoods score higher, especially 
on the verbal tests. 

6. Motivation, Children do not try hard on tests if the teacher’s 
attitude is negative, or if the parents think achievement 
tests are worthless—and say so loudly and often. 

7. Transfers, drop-outs. These children may affect local 
norms, usually adversely. 


In some private schools in which pupils are generally of high 
caliber because of stringent selection procedures, local norms will 
often be found to be considerably above national norms based 
on public school results. In a large city system we can expect an 
occasional sixth-grade class to fall below national norms even 
when the city as a whole is up to national standards. But when 
a number of classes fall below national standards, the curriculum, 
the teaching methods, the promotional standards and other con- 
ditions in the school and the community should be examined. 


Diagnosis. In looking over her test results for the sixth grade, 
Miss Clark may find that Harry is far below the sixth-grade norm 
in reading and that Sue is below the norm in arithmetic. At the 
same time, Mary reads at eighth-grade level, and John (the 
‘youngest child in the class) is up to the ninth-grade norm in 
science. Individual differences like these are the rule rather than 
the exception in most elementary classes, It is fairly easy for Miss 
Clark to prescribe further reading for Mary, and to stimulate John 
to carry out an individval project in science—for example, classi- 
fying the birds in the loca] community. The below-average chil- 
dren often present real problems, and as a result they are given 
more of the teacher’s time and effort than the bright children. 
If the extra time which Miss Clark can devote to Harry and Sue 
is insufficient to bring these children up to the sixth-grade levels 


= \ 
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in reading and arithmetic, they should be referred to special 
classes, if such are available. The larger the number of below- 
average children, the more difficult is Miss Clark’s task, and the 
more likely she is to neglect the bright children. 

It should be noted, as a further point, that the printed norm 
(local or national) does not necessarily establish the optimum 
level of performance for every pupil in the sixth. (or any other) 
grade. If Norman, whose iQ is 120, is just on the sixth-grade 
norm in reading and arithmetic, he is not performing up to ex- 
pectation—his scores should be above the norm for his grade. On 
the other hand, if Bill, whose IQ is a modest 94, is at or above 
the norm for the sixth grade in reading and arithmetic, he is 
actually doing better than we can reasonably expect of him. The 
intelligence of the child must always be considered in deciding 
whether his school work is “normal” for the grade. 

Sometimes Miss Clark will suspect from a pupil’s sullen be- 
havior, or open aggressiveness, Or his tendency to whimper at 
the slightest provocation that emotional factors are causing or 
contributing to his difficulties in school. Such a pupil should be 
referred to the school psychologist (if there is one) or to the 
school physician. The clinical psychologist is often able through 
tests and interviews to get a clearer idea of a pupil's difficulties 
than can the teacher. The teacher should visit a child’s home if 
she suspects that parents and. home environment are involved, as 
they often are. Corrective measures (when possible) can be more . 
intelligently applied when causal factors making for undesirable 
conduct and/or poor school work are known, rather than sur- 
mised from superficial impressions. 


Prediction. Whether it will be profitable for a student to take 
science or mathematics in high school or college can be forecast 
with considerable assurance from his performance on standard 
tests. Prediction of later success is usually improved when tests 
given in elementary schools are combined with a good intelli- 
gence test. Intelligence and achievement tests are regularly 
utilized in many schools in the selection and placement of stu- 
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dents in courses of study. The combination of achievement tests 
and special aptitude tests is valuable in predicting a student’s 
success in a professional school—in law or medicine, for example. 


ACHIEVEMENT TESTS IN SPECIAL 
SUBJECT AREAS 


In the preceding section, we described five general achieve- 
ment batteries designed to assess academic standing in school. In 
the present section, we shall consider several representative sub- 
ject-matter achievement tests. These include tests of reading and 
arithmetic, as well as tests planned to determine mental maturity 
(readiness) and proficiency in special subjects. Of the various 
subject-matter tests, those in reading and arithmetic are most 
often given, since they represent fundamental skills upon which 
school achievement largely depends. Subject-matter tests are 

_ found, of course, in the general achievement batteries, as well as 
in separate form. The tests listed below were selected as being 
typical of a very large number available. 


1. Metropolitan Readiness Tests 

Iowa Silent Reading Tests 
Co-operative Mathematical Tests 
Evaluation and Adjustment Series 
Co-operative French Test (elementary) 
Co-operative Science Test 


Cabra Bers Бано 


Metropolitan Readiness Tests* 


Description. The primary objective of these tests is to find 
whether a child is sufficiently mature to undertake the study of 
reading. But the tests are concerned also with “readiness” for 
arithmetic, and with general physical and mental maturity. The 
six tests in the battery may be described as follows: 


(1) Word Meaning: child selects picture named by the ex- 


aminer. 
* Published by the World Book Company, Yonkers, N. Y. 
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(2) Sentences: same as (1) except that the examiner uses 
sentences and phrases instead of single words. 

(3) Information: child marks the picture corresponding to the 
examiner’s oral description. 

(4) Matching: child must recognize similarities and differences 
in pictures, geometrical forms, numbers, letters, words. 

(5) Numbers: child must demonstrate a knowledge of number 
concepts and carry out simple operations. 

(6) Copying: child is required to copy simple graphic forms, 
as well as numbers and letters. 


All the test items are pictorial—that is, non-verbal. The test 
has two forms. The battery is essentially a prognostic test: its 
purpose is to forecast a child’s mental, sensory-motor, and mus- 
cular readiness for first-grade work. Figure 5-3 shows sample 
items. 


Scope. The test is for the end of kindergarten and the begin- 
ning of first grade. The test requires about sixty minutes working 


time. 


Scoring. Norms in percentile ranks allow the teacher to estimate 
a pupil’s readiness for reading (based on tests 1-4), readiness for 
arithmetic (test 5), and general maturity for first-grade work 
(tests 1-6). In addition, a child’s score is given a rating from A to 
E. An A rating denotes an excellent risk, the other letters a lesser 
degree of certainty down to E, which implies almost certain 


failure. 


Prognostic Value of the Metropolitan Readiness Tests. The test 
battery as a whole forecasts general maturity for the first grade, ` 
but its sub-tests may be used diagnostically to provide informa- 
tion about individual children. If Ben makes low scores on tests 1, 
2, 3, and perhaps 4, for example, he has inadequate maturity in 
language for first-grade work. Or he has too little experience 
with and comprehension of language generally. If Louise earns 
low scores on tests 4 and 6, she is probably too immature to under- 
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FIGURE 5-3 Sample Items from Metropolitan Readiness Tests 


іа 


Test 1. Word Meaning. In the first row, the child marks the baby; 
in the second row, the house. 


OPES 


ae a a. 


Test 4. Matching. In each row the child circles the picture identical 
to the one in the circular frome. 


Reproduced by permission of the Worid Book Company. 


take written work. As these two tests measure visual perception 
and hand-eye co-ordination, an eye examination and training in 
motor skills may be indicated. Test 5 (numbers) shows readiness 
for number work, and the child who scores high should be able 
to use numerical symbols. Test 6 (copying) has proved to be a 
good measure of physical and mental maturity. From this test, 
the teacher can pick up tendencies to reversals in drawing and ' 
writing, phenomena fairly common at this age level. If a child 
has not developed reading readiness by age 7%, he should be 
examined by a physician, an oculist, and perhaps a psychologist. 


— 
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lowa Silent Reading Tests* 


Description. This test consists of two batteries, one for elemen- 
tary schools and one for high schools and colleges. Both batteries 
measure reading rate, vocabulary, sentence comprehension, 
paragraph reading, and skill in locating information. Speed is an 
element in the battery, as well as power. The Elementary Test 
includes a reading comprehension test called “directed reading,” 
and the Advanced Test a test of poetry comprehension. 


Scope. The two batteries cover the following range: 


Elementary Test (four forms)—grades 4-8 
Advanced Test (four forms)—high schoo] and college fresh- 


men 


Working time for either battery is about 50 minutes. 
Scoring. There are six sub-tests in the Elementary Test: 


Rate and comprehension in reading connected prose. 

Directed reading of prose to get answers. 

Vocabulary and work meaning. 

Paragraph reading: selecting the main idea and adding 

appropriate details. 

5. Sentence meaning: understanding brief sentences out of 
context. 

6. Work-study skills: alphabetizing and using an index. 


> ч» Non 


Test 1 yields two scores (rate and comprehension) and Test 6 
two scores (alphabetizing and use of index). Tests 2, 3, 4, and 5 
yield one score each. These 8 sub-scores are converted into scaled 
scores by means of tables appended to each test. Scaled scores 
may be plotted on a profile to show the variations in perform- 
ance. Percentile norms are also provided by grade for each sub- 
test and for total score. There are age and grade equivalents to 
total score. 

The Iowa Test can be expected to spot (a) the extremely slow 


* Published by the World Book Company, Yonkers, N. Y. 
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reader, (b) the careless reader who fails to follow directions, 
omits necessary details, and skims over important facts, and (c) 
‚ the rapid but uncomprehending reader. 


Co-operative Mathematics Test for Grades 7, 8, and 9* 


Description. This test consists of four parts: I, skills; II, facts, 
terms, and concepts; III, applications; IV, appreciation. Ques- 
tions and problems cover basic arithmetic as well as simple algebra 
and geometry. The test may be used for survey purposes, but it 
is perhaps more valuable in evaluation and guidance. Sample items 
from the test are shown in Figure 5-4. 


Evaluation and Adjustment Series (High School)** 


Description. This is an extensive battery of subject-matter and 
other tests (twenty-four so far and more to be added) designed 
for use in high schools. The tests cover such traditional areas 
as algebra, biology, geometry, physics, history, and literature. In 
addition, there are tests of reading comprehension, “problems in 
democracy," health knowledge, and study skills. The content of 
the tests has been drawn from standard textbooks, courses of 
study, and professional literature. Tests may be administered as 
separates or as parts of a general survey. 


Scope. For survey and diagnosis in grades 9 through 12. There 
are two forms for most tests. 


Scoring. Raw scores are converted into scaled "cores for each 
test, so that comparisons may be made from test to test. Results 
may also be compared graphically by means of a profile. Many 
of the tests provide charts showing what score is to be expected 
at given IO levels. IQ's are from the Terman-McNemar Test of 
Mental Ability. The reliability of the various tests in the battery 


is satisfactory. The separate tests require from 45 minutes to an 
hour of working time. 


* Published by the Educational Testing Service, Princeton, N. J. 
** Published by the World Book Company, Yonkers, N. Y. 
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FIGURE 5-4 Sample Multiple-Choice Items from the Coopera- 
- tive Matbematics Test for Grades 7, 8, and 9 


From Part 1, Skills: 
39. «16 equals 
gu ш 


39-2 

39-3 8 

39-4 4 

30-5, 32a te. Ka edo OCD 


From Part Il, Facts, Terms, and Concepts: 


7. Which of the following is a unit in the 
metric system? 
Ounce 
Centimeter 
Yard 
Bushel 


From Part Ill, Applications: 


24. Ifa тап spends 12% of his salary on bonds, 
and buys a $37.50 bond each month, what 
is his monthly salary? 

24-1 $312.50 

24-2 $312.60 

24-3 $350 

24-4 $376.20 

24-5 $450. . e ee ee ee n on n 


From Port IV, Appreciation: 


20. Which of the following has no volume? 
20-1 Cylinder 
20-2 Cone 
20-3 Square 
20-4 Cup 
20-5 Rectangular box 


Reproduced by permission of the Educational Testing Service. 


Co-operative French Test (Elementary)* 


Description. The specifications for this test call for knowledge 
of French grammar and vocabulary, plus the ability to use the 


* Published by the Educational Testing Service, Princeton, N. J. 
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language in reading and translation. The test has three parts: 
vocabulary, grammar, and reading. The vocabulary section is a 
multiple-choice test of fifty words. Grammar (thirty-five items) 
requires the selection of one of five choices to complete correctly 
the translation of an English sentence into French. In the reading 
section, forty in&omplete sentences in French are to be completed 
from a list of five options. The reliability of the test is high. 


FIGURE 5-5 Sample Multiple-Cboice Items from tbe Coopera- 
tive Science Test for Grades 7, 8, and 9 


From Part 1, Informational Background: 


3. It is believed that dinosaurs lost out in 
their struggle for existence chiefly because 
3-1 they were killed by man for food. 
3-2 man could not tame them. 
$-3 they were not adapted to changes 

that took place in the carth's sur- 
face and climate. 


3-4 they were not fitted to cat plant food. : 
3-5 theyhadnobrains. . . . . а" 3 ) 


From Part Il, Terms and Concepts: 


2. The instrument used to look at and study 
the surface of the moon and the planets is 
the 
2-1 galvanoscope. 

2-2 microscope. 

2-3 telescope. 

2-4 clectroscope. 

2-5 radiometer. t. ne essere ive v eee ОС), 

13. If two plants of the same species but of 

different varieties are mated, the offspring 

are called 

13-1 mongrels. 

13-2 sports. 

13-3 biennials. 

13-4 lentils. 

13:5: hybrida РЕА. NE 13( ) 


Part Ill, Comprehension and Interpretation: 
‚ This test consists of multiple-choice items to be answered after reading 


а paragraph of scientific prose or examining a table. The selection must 
be understood and interpreted. 


Reproduced by permission of the Educational Testing Service. 
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Scope. This test is intended for the first two years of high 
school or for the first year of college study of French. 


Scoring. Scaled scores are provided for each of the three parts 
of the test and for the total. There are percentile rorms for high- 
school and for college classes. Working time for the test is forty 
‘minutes. 


Co-operative Science Test (Grades 7, 8, and 9)* 


Description. There are three parts to this test: Part I, informa- 
tion and background; Part II, terms and concepts; Part III; com- 
prehension and interpretation. The test is planned to measure 
knowledge and application. Part II is in multiple-choice form. 
Part III consists of readings in science, each reading followed by 
questions designed to assess the student's understanding, as well 
as his ability to interpret and apply what he has read. (Figure 5-5) 


Scope. Grade 9 and superior seventh and eighth graders. 


Scoring. There are scaled scores for the three parts and for the 
total. Percentile norms are given for grades 7, 8, and 9. The 
working time for the whole test is about eighty minutes. Re- 
liability of the whole test is high. 


WHAT TO LOOK FOR IN AN EDUCATIONAL 
ACHIEVEMENT TEST 


The suitability of an educational achievement test for a given 
situation must be determined from an examination of its validity, 
its reliability, its scaling techniques, and its norms. The cost, 
time, and personnel needed to administer and score the tests must 
also be considered. These same requirements apply to group tests 
of intelligence. Each of the main characteristics of a mental test, 
except perhaps validity, has been commented on at appropriate 
places throughout this chapter. A summary of the relevant data 
under each category will now be offered. 


• Published by the Educational Testing Service, Princeton, N. J. 


126 Educational Achievement Tests 


Validity. An educational achievement test is valid when it 
measures what it undertakes to measure. Most subject-matter 
tests possess content validity. An arithmetic test or a geography 
test or a reading test, for example, is valid by definition when it 
contains a sampling of arithmetic problems, geography questions, 
and peragraphs to be read. The standardized educational test is 
made up of items taken from a variety ofsources: widely used 
textbooks, courses of study, examination questions, and outlines. 
The items in tentative form are checked by experienced teachers 
and are put into objective form by test construction specialists. A. 
broad selection of items insures a comprehensive sampling of 
materials. 

One validation technique employed in some educational tests 
is the following. The test is provisionally drawn up and is admin- 
istered to an experimental group; only those items are retained 
which show an increasing percentage passing with age or with 
grade, Other techniques of item analysis will be described in 
Chapter 9. All of these procedures are directed toward selecting 
questions which will work together as a team, cover a wide range 
of difficulty, and be related closely in content (be homogeneous). 
The standard test, when finally made, is a compact and closely 
knit instrument for measuring what it purports to measure. Data 
on validation procedures will be found in most Manuals which 
accompany standardized achievement tests. 


Reliability. The reliability of the educational achievement tests 
described in this chapter has been generally reported as high. 
This means that parallel forms of the test correlate highly (over 
-90 in most cases) so that we may have confidence in the stability 
of a child’s score. In most test Manuals, reliability is expressed by 
the “reliability coefficient,” also called the self-correlation of the 
test, or by the standard error of an obtained score. The correla- 
tion of a test with itself (by retest) or between alternate forms of 
the same test tells us how closely the pupils’ scores “stay put.” 
The standard error of a score tells us how much fluctuation to 


iè 
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expect in a child’s score upon retest. If the standard error is three 
points, for example, the odds are two to one that Bill's score of 64 
will, on a second trial on the test, vary up or down from the first 
determination by not more than three points. The smaller the SE 
of a test score, the greater the stability of the obtained score. The 
SE of a test score gives us more information concerning reliability 
than does the reliability coefficient alone (page 29). 

Scores obtained or; most standard tests are highly stable, but 
part scores baséd on a relatively few items are variable and may 
be quite unreliable. Conclusions as to strengths and weaknesses 
based on unstable scores are always tentative, and must be re- 
garded as suggestive only. 


Scaling. Most educational achievement tests are first scored in 
arbitrarily assigned points, so many points being given for a 
correct answer. These point scores are usually converted into 
scaled scores by means of tables printed at the end of the sub- 
test. The meaning of standard scores and of "T-scores has been 
discussed in Chapter 2. Raw or obtained scores (point scores). 
from the sub-tests of a battery differ in length, difficülty, and 
content; they cannot be compared or combined as they stand. 
When scaled, scores expressed in different units are comparable. 
Scaled scores—and sometimes raw scores—are usually converted 
into age and/or grade equivalents—into the age and grade values 
which correspond on an average to the given scores. If the 
average child of 9 years and 4 months earns a score of 38 on an 
Arithmetic Fundamentals Test, then the score of 38 “equals” an 
educational age (EA) of 9-4. If children who are half way 
through the seventh grade (that is, at 7.5) earn a mean score of 
63 on a Reading Test, the score of 63 has a grade equivalent of 


7:5: 
The educational age (EA) may be divided by the chronological 


age (CA) to give an educational quotient (EQ). (ко 2 


This EQ is a measure of acceleration and is sor. ewhat analogous 
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to the IQ. The EA and EQ are often useful, provided they are 
taken to refer only to the tests on which they are based and are 
not thought of as general indices. 


Norms. Norms are typical measures of performance. In a 
standard educational test, the mean score made by a large and 
representative group of fifth-grade pupils is the norm for fifth- 
grade children on this test. Norms are expressed in age and grade 
equivalents, as percentile ranks, and in the form of scaled scores. 
A child’s grade placement is found by computing the tenths of 
the school year which have passed before the test was given. If 
the school year begins about September 1 and ends June 15, a 
sixth-grade class tested in the period between March 16 and April 
15 is assigned the grade position of 6.7—the class is 7/10 into the 
‘school year. Most standard educational achievement tests report 
nation-wide norms in their Manuals. These typical performances 
are based on the achievements of large groups of children from 
all over the country. As we have pointed out, local norms (for 
city or state or both) are often better measures of pupil achieve- 
ment. Any pupil’s scores relative to those of other pupils should 


be evaluated in terms of his effort, his intelligence, and his home . 
and community. 


Other Factors in the Selection of a Test. The cost of a testing pro- 
gram, the personnel required, and the time it will take from other 
s ool activities—all these must be considered in adopting a given 
test or tests. Tests which fit easily into a class period, which can 
be scored objectively (by means of stencils) by a clerk, and 
which are acceptable in form and content to teachers and to 
parents are in general least disruptive of the school’s routine. 


SUGGESTIONS FOR FURTHER READING 
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Greene, H. A., Jorgensen, A. N., and Gerberich, J. R. Measurement 
and Evaluation in the Elementary School (2nd edition). New York: 
Longmans, Green, 1953. 
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Jordan, A. M. Measurement in Education. New York: McGraw-Hill, 
1953. 

Traxler, A. E. et al. Introduction to Testing and the Use of Test Results 
in Public Schools. New York: Harper, 1953. 


SUGGESTIONS FOR LABORATORY WORK 


1. Administer two or three standardized achievement tests to the 
class, cutting the time to one-half if necessary. Have students score 
their own tests and plot profiles where called for. 

2. Analyze a standard reading test, listing the objectives which you 
think the author had in mind. Do you agree that these objectives were 
fulfilled? 

3. Select a test taken in (1). Consult the Manual for data on validity, 
reliability, scaling procedures, and norms. 


QUESTIONS FOR DISCUSSION 


1. For which of the following purposes would a standardized achieve- 
ment test be useful: 
(1) To discover which pupils have not mastered multiplication and 
division of fractions. 
(2) To determine which pupils are reading too slowly. 
(3) To determine for the class which. punctuation skills need further 
work. 
(4) To section the class into two groups for teaching arithmetic. 
(5) To discover the subjects in which each pupil is strong and in 
which weak. 
2. A teacher lists the following as objectives of a course in history and 
civics: 
(1) To present facts in the field. 
(2) To prepare the class for the duties of citizenship. 
(3) To further appreciation of democracy. 
(4) To foster criticism of governmental processes. 
(5) Toaid pupils in thinking about problems in government. 
Which of these objectives is the teacher most likely to fulfill? 
3, The Manual of Test ABC states that the test may be used for 
diagnostic purposes. What do you look for in a test to determine whether 


i i ic value? Е 
t As m ose of English states that batteries of standard tests tell the 
English teacher nothing that cannot be better found out from a theme 
and an interview. Do you agree? 
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5. Why is it necessary that sub 
nosis have high reliability? 

6. In some schools, teachers prepare for a testing program by having 
students review older standard examinations. What effect could this 
have on the students’ 


morale? On the comparability of test results from 
school to school? Is it good educational practice? 


7. In School A, the pupils in grades 4 to 7 are given the California 
Achievement Tests. Scores are recorded in grade equivalents only. What 
other types of scores would be valuable? Why? 

8. The Manual of a reading 
English marks in the first year 
validity? Discuss. 

9. Suppose that the Metropolitan Achievement Tests have been admin- 
istered in grade 5 in October. How might you, as the teacher, use the 
results of the test? 

10. For what predictive purposes would it be desirable to have the 
results from the following tests: 

(1) A test of abili 

various fields, 

(2) A test of skill in grammar: punctuation, capitalization, sentence 

structure, and so on. 

11. How could you use the results fr 
Supplement scores made b 

12. Is it important to h 


-tests in a battery to be used for diag- 


test reports .a correlation of .40 with 
of high school. Is this good evidence of 


ty to read difficult scientific prose drawn from 


om a group intelligence test to 
У your pupils on an achievement battery? 
ауе tests of speed, as well as of power? 


СНАРТЕК 6 


APTITUDE TESTS 


When a youngster possesses traits and abilities whi 
him to speak French readily, acquire mathematics, He 
with tools, or play a musical instrument well, he is said to fave 
aptitude for the given activity. Aptitudes are probably inherited 
basically, but they cannot appear unless the environment is 
favorable—that is, unless the opportunity is provided. Very often 
some training, often a great deal of it, is necessary, too, before an 


aptitude reveals itself in performance. 

Aptitude tests are not essentially different in form or in con- 
tent from intelligence ап 
all mental tests are in rea 


4 educational achievement tests, since 
lity measures of aptitude. Intelligence 
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tests measure capacity for school work and for vocations requir- 
ing school training; and achievement tests measure proficiency 
in English grammar, mathematics, science, and other subjects. 
Perhaps the chief difference between these tests and those de- 
signed to measure aptitudes is the fact that an aptitude test is 
concerned almost entirely with the furure—with prognosis. Thus 
an engineering aptitude test is used typically to forecast an ex- 
aminee’s chances of success in engineering. The aptitude test 
alone is, of course, rarely able to provide a wholly satisfactory 
estimate of probable performance later on. For an individual’s 
efforts to be maximally effective, aptitude must be supple- 
mented by training. Furthermore, the examinee must possess 
initiative, interest in the job, and favorable personality charac- 
teristics. 

We have classified aptitude tests under four heads: (1) general, 
(2) special, (3) professional, and (4) talent. The two best-known 
general aptitude batteries are those designed to assess aptitude 
for (a) mechanical tasks, and (b) for clerical work. Many special 
tests (of speed, co-ordination and reaction time) have been de- 
vised to measure aptitudes believed to be crucial in industry. 
Achievement tests, too, are employed as aptitude tests to reveal 
an examinee's performance in languages or mathematics, 
example, and hence provide a measure 
vanced courses. In the fiel 
batteries have been assem 
essary for success in m 
and in teaching. Aptitu 
talent, 
fields. 


for 
of his promise in more ad- 
d of professional work, aptitude test 
bled to assess the traits believed nec- 
edical school, in law 
de in music. anı 
and tests are available to forec 


,in engineering 
d art is generally called 
ast achievement in these 


GENERAL APTITUDE BATTERIES 


The general aptitude battery attempts to forecast probable 
success in a number of related 


tasks or vocations by sampling a 
wide range of behaviors believed to be involv 
In this section, two 


: ; ed in the activity. 
batteries designed to measure aptitude for 


mei os ew 
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mechanical work are described, together with two batteries 
planned to measure aptitude for clerical proficiency. 


Mechanical Aptitude 


The term “mechanical aptitude” includes a variety of behav- 
iors. One of the earliest mechanical aptitude tests consisted of 
a box containing a number of common gadgets in separate com- 
partments. Each of these contrivances (a lock, door bell, clothes 
pin, and so on) was to be assembled with the aid of simple tools. 
The score was determined by the speed and accuracy of assem- 
bly. This kind of test is often described as a “job sample” or 
“vocational miniature,” since it involves what has to be done on 
a small scale. Among the sub-tests in paper-and-pencil batteries 
devised to measure mechanical aptitude are (1) tests requiring 
motor speed and dexterity of movement, (2) tests of the ability 
to visualize or perceive mechanical and spatial relations (im- 
portant in reading blueprints and in architectural drawings); (3) 
tests of mechanical information concerning tools, machines, and 
the construction and use of various contrivances; and (4) tests 
of mechanical reasoning as demonstrated in the ability to solve 

roblems dealing with tools, pulleys, levers, machine parts, and 
the like. In addition, in assessing mechanical aptitude, inventories 
are used which are designed to reveal interest in mechanical 
things. Such interest may be shown, for example, when a boy 
reads Popular Science avidly, has his own tools, tinkers with 
radios, and builds space machines. One of the most useful find- 
ings to come out of the testing program in World War II was 
the discovery that paper-and-pencil tests of mechanical aptitude 
are as predictive of success in many mechanical jobs as are actual 


job samples covering the work. 
The following two test batteries are representative of the best 


r 


tests in this field: 


MacQuarrie Test of Mechanical Ability 
Bennett Mechanical Comprehension Test 
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MacQuarrie Test of Mechanical Ability* 


Description. This battery consists of seven paper-and-pencil 
tests, as follows: 


1. Tracing: following a narrow path. 

- Tapping: making dots rapidly. 

‚ Dotting: placing dots precisely. 

. Copying: making a figure from co-ordinates. 
Location: locating items by co-ordinates. 

. Block Counting: counting hidden blocks in a stack. 
. Pursuit: tracing a line through a tangled pattern. 


мом PWN 


Sample items from the MacQuarrie tests are shown in Figure 6-1. 

АП these tests are relatively simple and all are speeded: testing 
times are short. The MacQuarrie tests are designed to measure 
hand-eye co-ordination, finger movement and speed, manual dex- 
terity, visual acuity, and spatial perception of direction and size. 
"Taken as a whole, the MacQuarrie battery measures motor dex- 
terity as a fairly low level of difficulty rather than aptitude for 
engineering or for architecture. For the latter, the Bennett Test 
of mechanical comprehension is recommended. Some of the 
MacQuarrie sub-tests are predictive of special tasks: the test 
tracing, dotting and pursuit, for example, 
typing; and tests of block counting, tracing, 
copying are related to performance in me 
the reading of blueprints. The Manual w 
MacQuarrie advises the use of sub: 
success in various jobs. 


Scope. The MacQuarrie test can 
on. It has been employed chiefly i 
factory and other manual-manipul: 


s in 
measure aptitude for 
pursuit, location, and 
chanical drawing and 
hich accompanies the 
"test patterns for predicting 


be administered from grade 7 
n the prediction of success in 
ative work. 
Scoring. Percentile norms are available for the sub- 
total score. The working time for the whole test isa 
minutes. Since some of the tests in the battery are 


tests and for 
bout twenty 
allotted only 
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FIGURE 6-1 Sample Items from the MacQuarrie Test of 
j Mechanical Ability 


START 


Copying: Copy figure by joining dots. Blocks: How many blocks touch each 
block with on X on it? 


Pursuit: Follow each line by eye and show where it ends, by 
writing its number in the correct box at the right. 


Reproduced by permission of the California Test Bureau. 


ten to twenty seconds, a stop watch is needed in order to time 
the tests accurately. The reliability of the whole test is high. 
Reliabilities of the seven sub-tests are lower, but are fairly satis- 


factory for such short tests. 


Bennett Mechanical Comprehension Test* 


Description. This is a paper-and-pencil test in which compre- 
hension of mechanical relations is determined by means of pic- 


* Published by The Psychological Corporation, New York, N. Y. 
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tures and sketches. The test is fairly advanced in difficulty. Each 
picture or drawing has a simply phrased question designed to 
reveal the examinee’s understanding of the mechanical problem 
pete Figure 6-2 shows samples from the test battery. 


Scope. There are four forms of the Bennett test. Form AA, the 

easiest, is suitable for trade and high schools and for less well 
2 E B . B 

trained workers. Form BB, more difficult, is for engineering 


FIGURE, 6-2 Samples from Bennett Mecbanical Comprebension 


Test 


Which room has more of an echo? 


Which would be better shears 
for cutting metal? 


Ра 
У > ө 


j Which gear turns slower? 
DRIVER we See с 


Which cart is more likely to tip 
over on the hillside? 


Reproduced by permission of The Psychological Corporation, 
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school applicants, technicians, and engineers. Form CC, the most 
difficult, differentiates among examinees of high ability levels. 
The fourth form, WI, is for women. 


Scoring. Percentile norms, which are supplied for each test 
form, are applicable to a variety of student and occupational 
groups. The test is valuable in guidance, in selecting applicants 
with aptitude for mechanical thinking, and in the selection of 
students wanting to study mechanics and engineering. The Mac- 
Quarrie is a useful supplement to the Bennett test when speed 
and manual dexterity are required as well as more abstract think- 
ing about mechanical relations. 

The reliability of the Bennett is satisfactory. Validity is hard 
to determine, but the test is valid in relation to such criteria as 
grades in high-school shop courses and occupational and in- 
dustrial performance. 


Clerical Aptitude 


Tests planned to gauge clerical aptitude are concerned mainly 
with perceptual speed and accuracy in reading, writing, and 
marking, and with manual dexterity and skill. Office workers 
are designated in several ways, such as general clerk, sales clerk, 
shipping clerk, filing clerk, typist, and receptionist. The jobs 
differ in the kind and variety of their duties, but all demand (to 
a greater or lesser extent) reading, writing, sorting, checking, 
filing, folding, sealing, and stamping. 

The present section will describe two tests of clerical apti- 
tude, the first fairly narrow in functions covered, the second 


much broader. 


Minnesota Clerical Test 
General Clerical Test 


Minnesota Clerical Test* 
Description. This battery covers speed and accuracy in per- 


* Published by The Psychological Corporation, New York, N.Y. 
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ceiving clerical detail. There are two parts, number comparison 
and name comparison. In the first, the examinee is shown two 
hundred pairs of numbers each containing from 3 to 12 digits. 
If the two numbers are alike, the examinee places a check (v) 
between them; if they are unlike, he leaves the space blank. In the 
second test, proper names (which match or fail to match) are 
substituted for number pairs. Samples are shown below: 


79542 — — — —79524 
37943697. V — 5794367 
John C. Linder John C. Lender 


Investors’ Syndicate V — Investors' Syndicate 


The Minnesota Clerical Test is not designed to encompass all 
the factors which make for proficiency in office work, but it 
does attempt to predict ability to handle addresses, bills, accounts, 
and so on. The Minnesota test has been found to have р 
value іп the selection of clerks, packers, 
products), and other factory jobs. 


rognostic 
checkers, inspectors (of 


Scope. This clerical test may be used with students from junior 
high school on and for adults. 


Scoring. The working time of the test is about fifteen minutes, 


so that both speed and accuracy enter_into a score. Individual 


differences appear in the scores and must be taken into account 


in interpreting the test. A very careful examinee, for example, 
may make few errors but earn a relatively low score because of 
slowness and over-cautiousness. On the other hand, a fast but 


careless worker may mark more items but tend to make man 


errors. Percentile norms are available for boys and girls, о. 
and senior high-school students, and several groups of industrial 
workers. Among the latter there are norms for women who are 
machine operators, typists and clerks; for men who are tellers 
(bank), accountants and various sorts of clerks, A high score 


earned by a student does not necessarily mean that this examinee 


General Clerical Test (GCT) 139 


will make a good clerical worker, though it is a decidedly good 
omen, On the other hand, a high-school counselor would cer- 
tainly be wise to question the vocational promise of a com- 
mercial and business student who scored below the twenty-fifth 
percentile of clerical workers. The reliability of the test is high. 


General Clerical Test (GCT)* 


Description. This test battery is designed to measure three kinds 
of aptitude judged to be valuable in office work. There are nine 
sub-tests in the battery. Parts I and II test clerical speed and 
accuracy; Parts III, IV, and V numerical ability; Parts VI, VII, 
VIII and IX verbal facility. The first two (checking and alpha- 
betizing) measure perceptual speed and accuracy as expressed in 
such activities as sorting, coding, and alphabetizing. The next 
three measure numerical aptitude as shown in computation, error 
location and arithmetical reasoning. The last four measure verbal 
facility by means of spelling, reading, comprehension, vocabu- 
lary, and grammar. The over-all score is a good measure of 
abstract intelligence, as well as of aptitude for clerical work. 
The test is to be recommended, therefore, for clerical jobs which 
demand a relatively high level of intelligence. 


Scope. The battery is intended for use with high-school and 
business school students. The GCT may also be valuable when 
testing applicants for more responsible clerical positions. The 
working time for the test is about fifty minutes. 


Scoring. Percentile norms arê available for high schools and for 
business schools, as well as for various sorts of clerical workers. 
Norms for each sub-test, as well as for total score, are provided. 
The reliability of the whole test is high—greater than .90. The 
reliability of the sub-tests is much lower, and the counselor must 
be tentative in judgments based upon parts of the test. 


* Published by The Psychological Corporation, New York, М.Ү. 
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APTITUDE TESTS IN SPECIAL AREAS* 


In this section five test batteries often useful to the educational 
counselor and classroom teacher will be described. These spe- 
cialized examinations are illustrative of many tests in this field: 


Differential Aptitude Tests 

Minnesota Paper Form Board 

Murphy-Durrell Diagnostic Reading Readiness Test 
Orleans Algebra Prognosis Test 

Turse Short-Hand Aptitude Test 


Differential Aptitude Test (DAT)** 


Description. This battery is designed for educational and voca- 
tional guidance of high-school students. There are seven sub- 
tests, each of which yields a separate score: 


Verbal reasoning: A difficult verbal analogies test, 
to handle verbal relations. Aspirants for profe 
scores. 

Numerical ability: An arithmetic test cove: 


tions. This test is an important predictor in science and engineering. 
Abstract reasoning: A non-language test which demands the solution of 


problems expressed in diagrams and figures. The test measures a high 
level of abstract intelligence. 


Space relations: Ability to perceive a three- 
two-dimensional 
drafting. 


which measures ability 
ssions should earn high 


ring a wide range of opera- 


dimensional object from a 


pattern. Useful in engineering, architecture, and 


valuable in appraising th - 
tional training and work experience of an appli рр ا‎ 
** Publish 
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Test. Useful as a predictor of engineering aptitude when combined 
with the first four tests above. 

Language usage: Two tests scored separately which measure the ability 
to spell and to locate errors in sentences. Emphasizes the mechanics , 
of language as compared with test #1, which emphasizes abstract 
comprehension. 


FIGURE 6-3 Sample Items from the Differential Aptitude Tests 


MECHANICAL REASONING 
Which man in this picture has the heovier load? 


CLERICAL SPEED AND ACCURACY 
‘Test ITEMS - SAMPLE OP ANSWER SHEET 


In cach test item, one of the five combinations is underlined. 
Find the same combination on the answer sheet and mark it. 


LANGUAGE USAGE: І Spelling 
Indicate whether each word is spelled right or wrong. 


EXAMPLES SAMPLE OF ANSWER SHEST 


W. man 


X. gurl 


LANGUAGE USAGE: II Sentences ] 
Decide which of the lettered parts of each sentence contains errors, 


if any, mark the corresponding letters on the answer sheet. 


Ain't we / going to the / office / next week / at all. fii 
B с р Е 


А 
Reproduced by permission of The Psychological Corporation. 
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The illustrative items in Figure 6-3 show the nature of the 
sub-tests. 

A main feature of the DAT is that the total score is broken 
down into several components, so that from a student’s profile 
we have a record of comparative performance in eight funda- 
mental activities. The Manual gives explicit instructions for ad- 
ministering and scoring the test battery. In addition, a Casebook 


illustrates the use of the profile in diagnosis, and will be helpful 
to guidance counselors. 


Scope. For grade 8 and for high-school grades 9-12. 


Scoring. Percentile norms are supplied for grades 8 through 12 
for total score and for scores on each sub-test. Since there are 
large sex differences, percentile norms are given for boys and 
girls separately. Scaled scores (with a mean set at 50) are em- 
ployed in plotting the profiles. Figure 6-4 shows the profile of 

' a boy who could profit from educational counseling. Note that 
James is high in the space and mechanical tests, but mediocre to 
low іп all the others. The boy is certainly not “verbally minded,” 
although he appears to have real talent in mechanics. The teacher 
will understand James better if he has his profile available. 

The DAT represents the modern practice of substituting a 
number of analytic scores (for example, on a profile) fora single 
over-all score. We have noted (page 113) that diagnosis of 
strong and weak points from short sub-tests is always precarious 
because of their low reliability. The reliability of the total DAT 
25 very high, and the authors have increased the value of a diag- 
nosis from the sub-tests by computin 
between sub-test scores which w 
chance. This makes it possible t 
score in abstr 
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FIGURE 6-4 Profile of a High-School Boy on the Differential 
Aptitude Tests 


INDIVIDUAL 
REPORT 


, DIFFERENTIAL APTITUDE TESTS 
G. К. Bennem, Н. G. Sesshore, and A. G. Wemia 


THE PSYCHOLOGICAL CORPORATION 
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long (working time approximately three hours) and the cost 
relatively high. Good norms are available for the high-school 
grades (boys and girls taken separately), but there are relatively 
few data on occupational and vocational groups. The battery 
appears to have content validity, and various experimental studies 
indicate that it possesses empirical validity. For example, workers 
in the electrical, mechanical, and building trades score above 
hanical reasoning; and clerks are about average 


J e on mec х è ® 
averag erical ability and in clerical speed and accuracy and lan 
in num! n 
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guage usage. Engineering students score very high on all the sub- 
tests except the clerical tests, but are above the mean here. Men 
in the skilled trades (baker, butcher) are average in mechanical 
reasoning, and low in numerical ability, abstract reasoning, and 
space relations. Pre-medical students score high on all sub-tests, 
and especially high in verbal reasoning, numerical ability, and 
sentences. In the high school, verbal reasoning and sentences are 
predictive of grades in English; numerical ability, verbal reason- 
ing, and abstract reasoning show substantial correlations with 
mathematics and science, Unfortunately, the data do not reveal 
how successful a man is likely to be over a period of time in 


a profession, occupation, or trade. But the tests often provide 
significant clues, 


Minnesota Paper Form Board (MPFB)* 


-and-pencil test dealing 


ort to put a formboard 
on paper. Sample items are shown in F igure 6-5. 


figure cut into two or 


E 
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FIGURE 6-5 Sample Items from the Minnesota Paper Form 
Board Test 


1 
П 
1 
l 
1 
i 
4 


о 


Directions: For each item choose the figure which would result if the 
pieces in the first section were assembled. 


Reproduced by permission of The Psychological Corporation. 


course of action. Thus, we can tell a youngster that he had 
better not attempt engineering, but we cannot always offer him 
specific advice as to just what he should do. 


Scope. Grade 7 and above. 


Scoring. Norms are available for school grades and for various 
occupational groups. There are two forms of the MPFB, and 
the test is easy to administer and score. Counselors have found 
the test useful as a supplement to verbal intelligence and achieve- 
ment tests, especially for students planning to study architecture, 
engineering, commercial art, and other vocations requiring spatial 
perception and visualization. 

Murphy-Durrell Diagnostic Reading Readiness Test* 

Description. This test has been designed to measure three char- 
e important in the acquisition of reading 
visual discrimination, and learn- 


Yonkers, N. Y. 


acteristics believed to b 
skills: auditory discrimination, 
* Published by the World Book Company, 
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ing rate. Like other readiness tests it is prognostic, in that it 
forecasts whether or not a child is ready to begin reading. It is 
also an achievement test and could be so classified, as it measures 
the educational maturity of a youngster. The Metropolitan Read- 
iness Tests (page 118) may, in turn, be classified as aptitude tests 
rather than as achievement tests. 

The Murphy-Durrell test provides useful information for the 
first-grade teacher in deciding when to start a formal reading 
program and what outcomes to expect. At the same time, a good 
intelligence test will be useful in estimating general mental 
maturity. 

Scope. Early in the First Grade or before. 


Scoring. Test 1 and Test 2 (auditory discrimination and visual 
discrimination) require about an hour each. Test 3 (learning) is 
both an individual and a group test; there are twenty minutes for 
group instruction and three brief individual periods. Obtained 


Taw scores are converted into percentile norms for Tests 1 and 2; 
ratings are used in Test 3. 


Orleans Algebra Prognosis Test (Rev.)* 


Description. This is a prognostic test, the purpose of which is to 
determine whether a pupil is likely to succeed in (is ready for) 
algebra. The test is administered before the pupil undertakes the 
study of algebra. There are nine Parts, consisting of simple 
lessons covering s 


gusome y mple, use of 
symbols, substitution in equati i 


i gebra grades and achieve- 
ment test scores in algebra, 


Scope. For students planning to study algebra. 


* Published by the World Book Company, Yonkers, N.Y, 
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Scoring. The test requires about forty-five to fifty minutes. 
There are percentile norms corresponding to point scores. Fur- 
ther, there are expectancy charts predicting how well a child 
making a certain score can be expected to do in algebra. The 
reliability of the test is satisfactory. 


Turse Shorthand Aptitude Test* 


Description. This test is illustrative of aptitude tests developed 
for use with commercial and vocational subjects. The purpose of 
the test is to determine whether an examinee is likely to be 
successful in learning shorthand. There are seven sub-tests: strok- 
ing, spelling, phonetic association, symbol transcription, word 
discrimination, dictation, and word sense. 


Scope. For students planning to study shorthand. 


Scoring. There are percentile norms for students beginning the 
study of shorthand. The Turse test is correlated with achieve- 
ment in shorthand and is valuable in prognosis. The working 
time of the test is about an hour. 4 


APTITUDE TESTS FOR THE PROFESSIONS 


Tests of aptitude for the professions are primarily achievement 
tests designed to forecast a student’s chances of success in train- 
ing for medicine, law, or engineering. These tests are specialized 
in content and are essentially work samples in the designated 
field. Professional aptitude batteries are validated against grades 
in courses. It is not known precisely just how predictive these 
tests are of success in the actual practice of a profession, but 
there is some evidence that such aptitude tests are related—some- 
times highly related—to later success. 

The classroom teacher should be familiar with the general 
content and purpose of the professional aptitude tests, though 
he will rarely be called on to administer or score them. "These 


* Published by the World Book Company, Yonkers, N. Y. 
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batteries are not generally available, are often part of a testing. 


program, are highly specialized, and are usually scored and 


interpreted in a testing center. We shall, accordingly, give a less 
detailed description of them. 


Medical College Admission Test* 
Description. This test consists of four parts: verbal, quantitative, 
and science. The verbal section 


ial, economic, 
and political affairs. The science part of the test contains ques- 


l courses in biology, chemistry, and 
arious sections reveal the character 


Verbal section 
sporadic: (A) immediate, 
nate. (E) replete 
Quantitative part 
12. One-fifth of a batch of 2000 radio tubes Were defective. If one- 
fourth of the first 1000 were defective, what fraction of the 
second 1000 were defective? 
(A) 1/20 (B) 1/10 (C) 3/20 (D) 9/40 (E) 3/10 
Understanding society 


(B) regular, (C) occasional, (D) alter- 


Б Was the primar 
North Atlantic Pa 
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Science 
21. A sodium atom and a sodium ion 
(A) contain the same number of electrons 
(B) contain the same number of protons ` 
(C) have the same chemical properties 
(D) have the same physical properties 
(E) have different atomic numbers 


The first three parts of the battery are related directly to 
standing in medical school. The "understanding society" section 
is not related to medical knowledge, but is included in an 
attempt to select candidates for medicine who will be successful 
in adapting to the needs of the time. In their instructions to candi- 
dates, the authors write that “the test is intended to complement 
other data (your total college record, interviews, references and 
recommendations) with an objective inventory of your skills, 
concepts and information . . . acquired from formal study and 
from experience."* 


Law School Admission Test** 


Description. This battery is designed for use in selecting the 
best candidates from among those applying for law school. The 
battery has six parts: principles and cases, data interpretation, 
reading comprehension, debates (the examinee determines 
whether a statement supports, refutes, or is irrelevant to a given 
resolution), best arguments, and paragraph reading. Some of the 
material is difficult. The test battery has a correlation of about .50 
with law school grades. When combined with college marks, 
it is highly predictive of success in law school. 


Pre-Engineering Ability Test** 
Description. This test consists of two sorts of material: (a) 
comprehension of scientific materials, and (b) general mathe- 


* Medical College Admission Test, Bulletin of Information, Educational 


Testing Service, Princeton, N. J., 1957, p. 22. . 
** Published by the Educational Testing Service, Princeton, N. J. 
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matical problems designed to measure competence in this area. 
The first part of the test involves reading scientific prose, tables 
and graphs, and answering questions based upon these materials. 
The second part consists of problems in arithmetic, algebra, and 
geometry. The Pre-Engineering Test correlates about .50 with 


grades in the first term of engineering school. The reliability 
of the battery is high. . 


National Teacher Examination 
Description. These examinations are 
systems as an aid in the selection of t 


ployed also by teacher-training colleg 
their students, The ex 


planned for use by school 
eachers, and they are em- 


mprise the following sub-tests: 
Professional information: Child development, 


guidance, measurement, principles and method 


educational psychology, 
General culture: 


5 of teaching. 
matics and on literature, 
er the development and 


matical errors to be detected in Sentences. 
Non-verbal reasoning: A pattern completion test in Which the examinee 
must fathi i 


om the relationships In a given figure and choose the correct 
figure to Complete the pattern, 

The optional examinatio 
education in elementary schoo] 
sciences, English, industrial a 
and social studies. The four common examinations have exhibited 
substantial relationships with гап 
by Supervisors. The tests do 
factors, interest, or drive, 
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TESTS OF ARTISTIC APTITUDE OR TALENT 


Tests in this area are concerned with finding whether an 
examinee possesses some of the factors which appear to be neces- 
sary for success in music or in art. So many traits contribute to 
the success of an artist or a musician that it is impossible for an 
aptitude test to do more than tap some of the more obvious com- 
ponents. Perhaps the best the aptitude tests can do in many in- 
stances is to aid the counselor in steering away from the arts those 
aspiring students who have no real talent and whose money and 
time might be better spent in other pursuits. 

It is doubtful whether the classroom teacher will have the 
time, the training, or the equipment needed to administer and 
intefpret the aptitude tests in this area. Teachers engaged in 
guidance should be familiar with such tests, however—with what 
they are and what they are trying to do. Two tests of music 
and one of art will be described in this section. They are repre- 
sentative of aptitude measures in this field. 


Seashore Measures of Musical Talents 
Diagnostic Tests of Achievement in Music 


Meier Art Judgment Test 


Seashore Measures of Musical Talents* 

Description. This is a test of "ear for music." The test battery 
consists of six separate tests covering such attributes of tone as 
pitch discrimination, loudness, rhythm, time, timbre, and tonal 
memory. The tests are given by means of phonograph records. 
Each test item or problem presents a pair of tones or a tonal ' 
sequence. In the second playing, one of the tones is changed, 


. H i$ 
or the sequence of tones is altered in some way. Їп the pitch 
examinee marks on a test sheet whether 


lower (L) than the first. Com- 
re difficult as the difference 
reases. In the time and loud- 


on, New York, N. Y. 


discrimination test, the 
the second tone is higher (H) or 
parisons become progressively mo 
in pitch between the two tones dec 


* Published by The Psychological Corporati 
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ness tests, the second, or comparison, tone differs in strength 
or in pattern from the first. The rhythm test requires the exam- 
inee to decide whether the second of two patterns is alike or 
different from the first. The timbre and tonal memory tests 
differ somewhat from the others. In the first, two tonal patterns 
are compared for quality (consonance); in the second, a short 
series of from three to five tones is played, and then played a 
second timie with one note changed. The subject must write 
down the number of the altered note. The stimuli (tones) pre- 


sented by the phonograph records are as pure (uncomplicated) 
as possible. 


Scope. The Seashore Tests are applicable from the fifth grade 
on. 


Scoring. Scores from the six sub-tests are plotted on a profile 
to give a graphic representation of performance. Percentile norms 
are available for fifth and sixth graders, seventh and eighth grad- 
ers, and adults. The Seashore Tests have been used in schools of 


Music and in music courses in academic schools. The tests ad- 
mittedly do not run the gamut of musical talent, but they do 
measure important aspects of musical aptitude. A child who ranks 
low on these tests has a poor ear for music and is a doubtful 


selection for extensive musical training. The reliability of the 
battery runs about .80, 


Diagnostic Tests of Achievement in Music* 


Description. As the name implies, this test battery is designed 
to find how well students have ac 


quired the theory and tech- 
nical knowledge needed to read and understand music. The 
test consists of ten parts: diatonic syllable names, chromatic 
syllable names, number names, time signatures, major and minor 
keys, note and rest values, letter 


ies names, signs and symbols, key 
names and song recognition. Test content is based on materials 


° Published by the California Test Bureau, Los Angeles, Calif. 
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recommended by musical authorities as fundamental in musical 
education. A piano is required for the tests. 


Scope. For grades 4 through 12. Test items are graded up 
sharply in difficulty. 


Scoring. Norms for the test are based on the degree of mastery 
shown by students for the various sorts of material. Strengths and 
weaknesses are revealed by comparison of scores upon the ten 
parts of the test. The reliability of the whole test over the rather 
wide range for which it is applicable is very high. Reliability for 
separate grades is lower, but probably satisfactory. Working 
time for the test is about sixty minutes. The Diagnostic Tests 
are a useful supplement to “ear” tests like the Seashore. An ear 
for music is necessary for any musical activity, whereas a knowl- 
edge of the technical aspects of music is necessary for one 


aspiring to be a musician. 


Meier Art Judgment Test* 


Description. This test consists of a hundred problems in each of 
which an artistic judgment is demanded. Each test item is pre- 
sented in two versions. In the first version, there is a painting or 
drawing by some well-known artist, or an acknowledged artistic 
design; in the second, the same theme is presented but in altered 
form, the change being in symmetry, balance, unity, or rhythm. 
All pictures are in black and white, so that no complication is 
introduced (nor any clues) by color. The examinee is told that 
the two versions of the picture differ and is asked to select the 
better version. The test is, accordingly, a measure of aesthetic 
judgment, the criterion being the consensus of experts in art. 
See Figure 6-6 (facihg page 150). 

1 

Scope. The Meier test is intended for junior and senior high 

schools, as well as colleges and art schools. 


* Published by the Bureau of Educational Research and Service, University of 
Iowa, lowa City, lowa. & 
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Scoring. Norms are for students in art courses. High scores do 
not necessarily mean that the student is destined to be an artist. 
But a low score should be a warning signal to one planning a 
career in art. The Meier test has correlations of from about A 
to .50 with grades in art courses, but low correlation with Scores 
on verbal intelligence tests. This does not mean that artists are 
unintelligent, but that many factors besides abstract abili 
'enter into artistic appreciation. The , reliability of th 
about .75 for fairly homogeneous groups. 


ty must 
e test is 


HOW TO JUDGE. AN APTITUDE TEST 


Like tests of intelligence and of educational achievement, apti- 


tude tests must be judged by the adequacy of their validation, 
reliability, scaling, and norms. Various comments concerning 
these aspects of the tests described in this chapter have been made 


m appropriate places. This and other material will now be 
summarized, 


Validity. Aptitude tests generally possess content validity. Tests 
of speed, dexterity, seei 


cal problems, and 


view toward forec 
formance in school and (hopefully) in life, 

Aptitude tests have been validated against various criteria, in- 
cluding grades i 


asting per- 


DAT. 


Reliability. The reliability of the standard 
generally satisfactory, and w 
of a score. In some cases, 
as the reliability coefficien 


aptitude tests is 
€ can have confidence in the stability 
the standard error of a score, as well 


t, is given by the author of a test. 


T 
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Scaling. In most aptitude tests, raw scores are converted into 
percentile ranks. In some tests (the DAT, for example) scaled 
scores are used on the profile in making comparisons of a given 
student’s scores, 

Norms. All aptitude tests have norms either in percentiles or in 
scaled scores. A few, tests give norms for certain occupational 
groups. One drawback to the use of vocational aptitude tests, 
however, is the lack of adequate norms in many job areas. There 
is a need for information regarding the predictive value of pro- 
fessional and vocational tests for persons long out of school. 
It would be a great advance if we knew how well an aptitude 

‘test culd forescast the success of engineers or lawyers, for ex- 
ample, and not simply grades in courses. 
a 


SUGGESTIONS FOR READINGS 
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SUGGESTIONS FOR LABORATORY WORK 


ardized aptitude tests to the class. Cat the 


1. Administer several stand: 
Students should score their own tests and 


time allowance if necessary. 


plot profiles when called for. : 
ifications which. the author lays down 


2. Find in the Manual the spec 
for his aptitude test. Examine the items of the test. Do you agree that 


the test has content validity? Are any data given on experimental 


validity? А 
е Differential Aptitude Tests. Analyze ће battery 


3. Make a study of th 
for validity, reliability, scaling procedures, and norms. 


\ 
QUESTIONS FOR DISCUSSION 


1. How do aptitude tests differ from readiness tests in purpose and 


content? 
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2. Why does a test battery like the Minnesota Clerical Test vary 
greatly in the accuracy of its forecasts of office work? 

3. Why are aptitude tests used more often in high schools than in 
elementary schools? 

4. Give some reasons why paper-and-pencil tests of “mechanical 
ability are as useful as are work samples in determining aptitude. 

5. How would you set up a program for selecting candidates for a 
nursing school? Outline the procedures you would use. 

6. Would you use the Bennett Mechanical Comprehension Test to 
select workers in an automobile factory? 

7. It has been said that the best measure of aptitude for mathematics 
(or for any subject) is the achievement to date. Do you agree? 

8. How could you discover whether the Meier Art Test is measuring 
native artistic ability and not training in art? 

9. A girl of 16 scores very high on the Seashore Music Test. Would 
you advise her to undertake a career in music? Why or why not? What 
else might you need to know about her? 

10. How can a follow-up study of graduates of law and medical schools 
be useful to a counselor using a professional aptitude test? 


СНАРТЕК 7 


PERSONALITY TESTS 


е indicated on several occasions 


In previous chapters, we hav 
of intelligence, 


that prediction of success based upon measures 
school achievement and aptitude must always be qualified by the 
statement “provided the personality traits are favorable.” In the 
present chapter, we shall attempt to see how well we can deter- 
mine favorable and unfavorable personality traits. 

There are a number of descriptions of personality, and the 
usefulness of any definition will depend in most cases upon the 
purposes of the author. For the teacher or school counselor, a 
practical working definition is to the effect that personality is a 
student’s characteristic Way of doing things. Suppose that two 
boys, John and Jim, are about the same age, have about the same 
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IQ, and do about the same caliber of school work. But suppose 
further that John is friendly, highly motivated, and likeable; that 
Jim is sullen, indifferent in attitude, and generally avoided by 
teachers and classmates. The decided contrast in the behavior of 
these two boys arises from their distinctive personality traits, 
not from their differences in mental ability. Failure in school— 
or in life—is often the result of a person's inability to make good 
use of tis’ potential personal assets. Obviously, lack of success 
may arise either from negative (unpleasant) personality traits or 
from failure to make use of 


The psychologist has attempted to evaluate personality traits 


in three ways: (1) by rating scales, (2) by questionnaires and in- 
ventories, and (3) by what are called "projective tests." The first 
two approaches are the more readily applicable in the schools. 
Projective tests should be employed only by clinical psycholo- 
logists and psychiatrists; since they require special training for 
their administration and interpretation. These "tests" are essen- 
tially clinical instruments and are most of ten used diagnostically in 
cases of severe personality disorders or in behavior problems where 
drastic emotional disturbances are suspected. Rating scales and 
inventories, on the other hand, can be administered and inter- 
preted in a useful Way by teachers and counselors. 


RATING SCALES 


The rating scale is а device for obtaining judgments of the 
degree to which an individual possesses certain behavior traits and 
attributes not readily detectable by objective tests. In the school 
situation, rating scales provide appraisals of a teacher (or of a 
candidate for a teaching position) in several characteristics. Rat- 
ings by teachers or principals are often required for students 
seeking entrance to college or looking for a job. In making a 
rating, the judge expresses his opinion by marking along a grad- 


uated scale or by checking in the category which he feels best 
describes the person being rated, 


positive (pleasant) personality traits. · 


* 


FIGURE 7-1 Sample Items from Various Graphic Rating Scales 


1. From a graphic rating scale for clerical workers: 
Accuracy—Consider carefully quality of work, freedom from error. 


4—6 i ee ey 
very’ few careless many 


no 
errors careful errors errors 


2. From a behovior rating scale for children: 
Is his attention sustained? 


Distracted: Difficult Attends Is Able to 
jumps rapidly . to keep at adequately. absorbed hold 

from one thing а task until in what attention 

to another. completed. he does. for long 

periods. 

(5) (4) (3) (2) a) 


3. From the American Council on Education Rating Scale for 


prospective college students: 
Does he get others to do whet he wishes? 


Sometimes Sometimes Displays marked 


Probably lets others 

unable to take lead. leads in leads in ability to lead 
lead his minor important his fellows; 
fellows. affairs. affairs. makes things 90. 


4. From a rating device for teacher candidates: 
(Put а check under the appropriate heading) 
Very inferior | {п ferior Average 


Superior Very superior 


Tact 
5. From a rating scole for teachers: 
(Circle the number which best indicates the degree or extent to 


which the qualities ore practiced.) 
O= unsatisfactory; 1=below average; 2=overage: 


= above, average: 4= superior. 
Emotional matu 
To what extent 


does the teacher exhibit desirable 01234 
balance between emotional responsiveness and emotional 
control? Consider disposition, sense of humor, restraint 
and thoughtfulness in dealing with others, feelings of 
security, objectivity of interest, freedom from excessive 
fears and worries ond warmth of feeling and expression. 
cle for officer candidates: 


6. From о rating 5c 
fellow candidates: 


R s with 
і Cooperates Cooperates Leads and 


Uncooperati Grudgingly 00) 
cooperative ant willingly cooperates. 
contributes Good ideas. 
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Perhaps the most useful type of rating device is the Graphic 
Rating Scale or some variation of it. The typical graphic scale 
consists of a straight line, for example, five inches long, which is 
taken to represent the range of behavior in the trait. In lieu of a 
line, several categories representing gradations in the trait may 
be provided. The illustrations in Figure 7-1 are samples from 
various rating scales. 

Units on the graphic rating scale are represented by the suc- 
cessive scale divisions, but the directions make it clear that the 
check mark indicating the judgment may be placed anywhere 
along the scale line. A graphic scale is often scored by separating 
the scale line into one hundred points. A person’s rating is then 
determined by the distance of the judge’s check from the low 
end of the scale. A more summary method is also used: if there 
are five main divisions on the scale, the highest division may be 
designated “1,” the next division “2,” and so on down. 


Requirements of a Good Rating Scale 


A good rating scale should satisf 
Traits should be care 


oyalty, courage, unselfishness) 
served in social behavior. Char- 


= 


Ф 
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intimate details about a teacher—whether he sings in the choir, 
loves his mother, or plays golf well. Information of this sort is 
often called for, however. Judges do not often have occasion to 
observe or to learn about personal behavior and, in general, 
should not be asked to supply such information. 

The number of divisions on the scale should be neither too 
numerous nor too few. The optimum number of divisions on a 
graphic scale is perhaps five to seven. Fewer divisions than five 
causes the groupings to be too coarse; more than seven divisions 
demands fractionings of the trait which are too fine for most 
raters, with the result that a large part of the scale may be unused. 
A five-division scale is popular, since it corresponds to the mark- 
ing system A, B, C, D, and E. Furthermore, the five categories 
“high,” “above average,” “average,” “below average,” and 
“poor” seem to mark off fairly natural divisions. 

Directions to tbe rater sbould be explicit. The adequacy of the 
directions given the rater will have a substantial effect on the 
validity and reliability of his ratings. The rater (1) should be 
given as explicit directions as possible, (2) should be told what 
is meant by the distribution of a trait, and (3) should be warned 
against assigning too many “average” ratings. This last is some- 
times needed when the persons to be rated are not well known 
to the rater, when the meaning of the traits is not well under- 
stood, and when the rater is overcautious. Raters must be warned 
against the “halo effect” and the tendency to see logical relations 
among traits—to assume, for example, that intelligence and moral 
behavior or intelligence and good work habits of necessity go 
together. 

‘As for the distribution of traits, the best first hypothesis (in lieu 
of other information) is to assume that ratings will be distributed 
in the form of a normal curve. When the baseline of the normal 
curve is subdivided into five equal parts, the percentage in each 
division (reading from either end of the curve) are 7, 24, 38, 
24, and 7. If there are seven divisions on the scale, the per- 
centages in subdivisions are 4, 10, 22, 28, 22, 10, and 4. The direc- 
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tions should make it clear that the exact proportions in the normal 
curve should not be followed slavishly. But the point should be 
stressed that when a group of teachers is rated for “skill in instruc- 
tion” on a five-division scale, we must expect many more in the 
middle of the scale than at either end. 

The tendency to assign too many "average" ratings (the “cau- 
tion factor”) is to be contrasted with the tendency to assign too 
many high ratings (the “generosity factor”). Stress on the 
normal distribution of most traits will help correct this tendency. 
If the low end of the rating scale is described by such unpleasant 
terms as “stingy,” “stupid,” “mean,” this part of the rating line 
may be avoided by many raters. The “halo effect” mentioned 
above is the tendency to rate a perso 


n high on all traits jf he is 
well liked or is regarded as highly intelligent. Conversely, if a 
person is disliked, there is a te 


ndency for him to be rated low on 
all traits. To minimize halo, the rater is usually told to rate all 
candidates on a single trait, then to rate all on a second trait, and 
50 on. This procedure is impractical, of course, when the rater 
is called on to consider one person at a time 
ten traits. We are often forced, therefore, 
against halo and careful definition of traits. 


and rate him on, say, 
to resort to warnings 


Validity and Reliability of a Rating Scale 
Ratings for intelli 


Consensus of judges. In ratin 
reliability mean virtuall 
agree that Brown is a 
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visor had so stated. In general, confidence increases with the 
number of agreements when ratings are made independently. 

It can be shown that if the estimates of two judges correlate 
60, then the average of these ratings will correlate .75 with those 
of two equally good judges. It is, of course, difficult to decide 
when two judges are “equally good.” We can never guarantee 
this to be true, but we can (a) select judges who at least know 
the ratees well, (b) provide careful definitions of the traits to be 
rated, and (c) allow for individual differences in rating standards. 


Summary on Rating Scales 
Ratings from graphic scales will generally deserve confidence 
when: 

1. Qualities which can be observed in behavior are rated. 
Energy, appearance, and teaching skill are better rated than 
are character and moral traits. 

2. Characteristics to be rated are illustrated. The use of behavior- 
grams (page 160) and instances will strengthen the ratings. 

3. Raters have actually observed the persons to be rated in 
situations where personality might be revealed. 


4. Independent ratings are pooled. 
5. Judges are confident that the ratings are valuable. 
6. Different standards are accounted for by explicit directions 
or by statistical techniques. 
The above rules are perhaps most useful when one has the 
roblem of constructing а rating device; and they may not seem 
to be very helpful to the teacher who is faced by a ready-made 
scale. Teachers and supervisors rarely have the responsibility for 
devising a rating scale. But teachers are rated by supervisors and 
supervisors by principals. Moreover, students are rated by 
teachers and by principals for personality traits judged important 
by colleges or prospective employers. Hence, the teacher should 
be familiar with how the rating scale is put together and how 
it works. Raters can improve а rating device by offering crit- 
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i this sort 
icisms and commendations, Eventually comments of 


should lead to a better scale, ` 


QUESTIONNAIRES AND PERSONALITY 
INVENTORIES 


rd 
Pip a standa 

The behavior inventory calls for short answers to 
set of questions believed to br: 


he Inventory or 
report rather tha 


5, self- 
JUstment—is revealed by a dei session 
Cars, feelings of insecurity or of | attitude 
onfidence and the like. The уйу 
„canv he examinee’s feelings, opinions, ; 
about Varlous institutions (for example, the churc of speech 
l | ters (for example, war, freedom © 
and internationalism) Finally 
wit Preferences for occupatio ations: 
45 Physics or histo ks, sports, hobbies, алй эт ма 
Approach, In the first, the may take the direct or the 


a 
igh places (direct) or tbe i а 
Ons) whether he would rat t form 
орд ine pilot (indirect). In the ا‎ is less 
likely Sues е “Sumption made is that an е what 
Motive or wha lize his answers when he isn ;ng to Ul 
. пагу trait the inventory is trying 


€r questi 
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The Personality Inventory 


Personality inventories were used by the armed forces in 
World Wars I and II to screen out the maladjusted and those 
likely to become mentally ill. These personal data (or PD) 
sheets consisted of lists of symptoms reported by men who sub- 
sequently had suffered from “nervous breakdown” or were classi- 
fied as psychoneurotic. Adult questionnaires have been revised 
by deleting the more serious and disturbing items so that they 
could be used in the schools. Questions which were removed 
deal with the more reprehensible forms of adult behavior such 
as those involving liquor and sex offenses. In the schools, the 
questionnaire is used to locate pupils with potentially handi- 
capping personality problems. The acceptability of a PD Sheet 
for pupils, parents and the community is necessary if the Inven- 
tory is to be used generally as a group test. A teacher will be well 
advised to make sure that the inventory he purposes to use has 
“ the approval of the school authorities. It is important that the 
reading level demanded by an inventory be carefully scrutinized, 
since many items may not be understood. 

The personality inventory is most valuable in the schools for 
counseling and guidance—that is, for spotting pupils with exist- 
ing or potential personality difficulties. When used individually 
and in face to face contacts, the PD Sheet is more flexible and 
becomes essentially а directed interview. Answers given by the 


student can be pursued further until their meaning is clear. This 
cannot be done, of course, when the inventory is administered 


in group form. Of the personality inventories available (many 
cover the same ground), the following represent acceptable 


“tests” for use in the schools: 
California Test of Personality 
Pintner's Aspects of Personality 
Gordon’s Personal Profile and Personal Inventory 
Bell’s Adjustment Inventory 


Thurstone Temperament Schedule 
Each of these questionnaires will be considered in this section. 
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California Test of Personality* 


Description. This test series runs the gamut from the elementary 
grades to adulthood. Each battery is divided into two sections 
designed to measure (1) personal adjustment and (2) social ad- 
justment. The six sub-tests in section 1 are designed to bring out 
how a student thinks and feels about himself, his feelings of con- 
fidence and adequacy, his tendencies to withdraw within himself 
and to exhibit nervous symptoms. In section 2, the six sub-tests 
question the examinee on his knowledge of social standards, his 
social skills, his freedom from anti-social attitudes, and his rela- 
tions to family, friends, and the community. The questions are 
Yes-No in form, Figure 7-2 gives samples from the test. 


There are five separate test batteries: 
Primary Series, kindergarten to grade 3 
Elementary Series, grades 4-8 

* Published by the California Test Bureau, 


Scope. 


Los Angeles, Calif, ' 


FIGURE 7-2 Sample Items from tbe California Test of Per- 
sonality, Elementary, Grades 4-5-6-7-8, Form AA 


PERSONAL ADJUSTMENT (Circle YES or NO) 


10. Do your parents or teachers usual 

do your work? YES МО 
23. Do people often think that you cannot do things well? YES NO 
25. Do you feel that your folks 


boss you too much? YES NO 
38. Are you proud of your school? 


ly need to tell yov to 


YES NO 
50. Would You rather stay away from most parties? YES NO 
68. Do you often feel tired before noon? YES NO 


SOCIAL ADJUSTMENT 


nasty to them? 
114. Do you like both 


YES NO 
of your parents about the same? 


i YES 
123. Is it fun to do nice things for some of the other e 
boys or girls? 
Y 
139. Do you try to get friends to obey the law? He 


YES NO 


Reproduced by permission of the California Test Bureau 


y 


<? 
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Intermediate Series, grades 7-10 

Secondary Series, grades 9-college 

Adult Series 
Over-all time for administering a test battery is approximately 50 
minutes. 

Scoring. Answers can be recorded in the test booklet itself or 
on a prepared answer.sheet. Scoring is objective and easy. A 
profile of the different scores and over-all adjustment score can 
be constructed. The pupil's earned score (point score) is entered 
opposite the personality component and the percentile rank cor- 
responding to this score is found in the appropriate table. Per- 
centile Tanks for total personal adjustment and for total social 
adjustment may also be entered. 

'The reliability of the five batteries is quite high (.80-.94). 
Percentile norms are provided for each sub-test and for the 
battery as a whole. This inventory is a useful indication of a 
pupil's all-around adjustment. Diagnosis from the sub-tests is sug- 
gestive rather than conclusive; but many valuable clues which 
serve to explain a child's behavior may be obtained. 


Aspects of Personality (Pintner)* 

Description. This inventory consists of three parts: ascendance- 
submission, extroversion-introversion, and emotionality. It was 
designed to aid the classroom teacher in locating children who 
have developed—or are likely to develop—serious behavior prob- 
lems. Samples of the kind of items found in the test are as follows: 
Same-Different 


I have a lot of nerve А 
I like to read before the class Same-Different 
I feel tired most of the time Same-Different 
When a child tries to push into line ahead of me, I 

am not afraid to tell him to get back Sarhe-Different 


The pupil indicates his agreement or disagreement with a state- 
ment by marking or circling “same” or “different.” 


* Published by the World Book Company, Yonkers, N. Y. 
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Scope: Aspects of Personality is intended for use in the elemen- 
tary grades and junior high school. 


Scoring. This inventory is readily scored by means of a-stencil. 
There are separate norms for boys and girls, and for two levels 
of maturity. A low score on the ascendance-submission part in- 
dicates a shy child, a high score an aggressive one. High scores on 
extroversion-introversion suggest good adjustment; low scores 
suggest withdrawing tendencies and daydreaming. Low scores 
on emotional stability suggest flightiness and lack of control. The 


total score is a rough index of personal adjustment, and probably” 


only wide deviations should be investigated. The test often pro- 
vides useful clues. 


Personal Profile and Personal Inventory (Gordon)* 


Description. The Personal Profile is designed to measure four 
fairly distinct personality traits: (a) ascendancy, (b) responsi- 
bility (perseverance or reliability), (c) emotional stability, and 
(d) sociability. The examinee is asked to indicate which of four 
statements (there are eighteen sers) is most descriptive of himself 
and which is least descriptive. A specimen set is 

Able to make important decisions without help 

Does not mix easily with new people 

Inclined to be tense or high strung 

Sees a job through despite difficulties 
Each of these phrases is descriptive—positively or negatively— 
of one of the four traits included in the inventory. — 
Jp рет liy questionnaire uses what has been called the 

-choice" technique—that is, the examinee is instructed to 


choose between statements two of which appear to be equally 
acceptable and two equally unacceptable. (See the description of 
the indirect questio 


indi: nnaire on page 164.) This method of pre- 
senting items has certain advantages. If the two choices are fairly 
well equated for social value, it is hard for the examinee to fake 


* Published by the World Book Company, Yonkers, N. Y. 
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his answer, since he does not know clearly what is behind either 
choice. Again, the use of forced-choices reduces hesitation and 
indecision, since the examinee is required to make a decision, 
rather than simply choosing between Yes and No. If the examinee 
likes none of the choices, he may select the least objectionable. 
In addition to the four trait scores, the Personal Profile yields a 
total score, which may be represented graphically along with 
the four part scores. Very low total scores have been found to 
be associated with maladjustment and poorly developed per- 
sonality. à 

The Personal Inventory also covers four traits: caution, original 
thinking, personal relations, and vigor. The total score depicts 
the student’s personal development in these areas. 


Scope. The Profile and Inventory are designed for high schools, 
colleges, and adults. 


` Scoring. Percentile norms are available for each scale, for boys 
and girls separately, for high school, and for college. The four 
scores and the total may be represented graphically on a profile. 
These questionnaires have considerable validity, as is shown in 
follow-up studies. Together, the two inventories are useful in | 
counseling students and in screening out those with potential 
behavior problems. Reliabilities of the sub-tests and of the total 


are satisfactory. 


Adjustment Inventory (Bell)* 


Description. This well-known inventory consists of questions 
to be answered Yes, Мо, or ?. It has been designed to estimate 
ment in four areas: home (satisfactions and dis- 
health (illness and general well-being), social 
and so on), and emotional be- 
and the like). Samples of 


personal adjust 
satisfactions), 

relations (shyness, aggressiveness, 
havior (self-confidence, depression, 
the kinds of items in the questionnaire are 


* Published by the Stanford University Press, Stanford, Calif. 
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Ате you troubled with shyness? Yes 
Do you daydream frequently? 
Are you often low in spirits? 


No ? 
Yes No ? 
Yes No ? 
The Bell inventory has proved useful chiefly in locating students 


who need counseling. It provides valuable leads to social and 
personal maladjustment. 


Scope. The student form of the Bell is for high school and 


college students. There is a form for adults which may be used in 
vocational counseling. 


Scoring. The inventory is not timed, but ordinarily requires 
about twenty-five to thirty minutes. The over-all reliability is 


high. There are percentile norms for high-school and college 
students and for men and women. 


Thurstone Temperament Schedule* 


Description. This inventory consists of a set of 140 questions, 
twenty items being grouped under each of seven aspects of 
temperament or emotional expression. Adjectives describing the 
seven temperamental traits are: active (degree of energy), 
vigorous (participation im physical activities, sports), impulsive 
(happy-go-lucky), dominant (aggressive, forthright), stable 
(emotionally), sociable, and reflective (thoughtful, meditative). 
These seven behavior areas, which were identified through a 
study of the intercorrelations of many personality variables, are 
believed to constitute certain basic aspects of social behavior. The 
inventory is well adapted for use with normal people: items 
obviously bearing upon mental disease have been avoided. 


Scope. For high schools, colleges, and adults. 


Scoring. Percentile norms are available for the sub-sections, 
and the seven scores may be plotted on a profile for a study of 
idiosyncrasies. The reliability of the whole inventory is high. 
However, the reliabilities of the sub 


-sections are not high. Hence, 
* Published by Science Research Associates, Chicago, Ill. 


Ascendance-Submission Reaction Study 171 


although the inventory may provide valuable clues to a counselor, 
diagnosis based on part scores should be tentative. 


ATTITUDE SCALES 


When we know that a man is a Socialist or a Christian Scientist, 
we feel fairly sure that we can predict his answers to questions 
dealing with politics or religion. An attitude is a consistent point 
of view, a way of behaving toward an institution, a social group, 
or toward personal, political, or religious issues or practices. 
Attitudes may be fairly narrow or quite broad, and they may be 
strongly or weakly held. In general, an attitude pivots around 
strong likes or dislikes. A person's attitude toward drinking, 
professional sports, popular music, or “eggheads,” for example, 
will be exhibited in expressions of opinion which are often 
emotional. - 

Scales for measuring the spread and strength of attitudes have 
often been used by social psychologists but are rarely employed 
routinely in the schools. One of the most comprehensive lists of 
attitude scales (about thirty in all) has been constructed by 
L. L. Thurstone and his assoċiates at the University of Chicago. 
These scales estimate the strength of one’s attitude (on either 


the favorable or unfavorable side) toward such diverse matters 


as war, capital punishment, the church, and communism. 
In this section we shall describe two scales both of which 
have been useful in high school and college. These are 
Ascendance-Submission Reaction Study (Allport) 
Study of Values (Allport-Vernon-Lindzey) 


Ascendance-Submission Reaction Study* 

Description. This questionnaire attempts to determine whether 
person characteristically dominates or is dominated in the face- 
to-face contacts of everyday life. The A-S Reaction Study is 
usually classified as a personality inventory, but it can just as 


* Published by the Houghton Mifflin Co., Boston, Mass. 
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well (perhaps better) be described as an attitude scale, since it 
tries to discover an individual’s habitual way of behaving in 
everyday social contacts. There are two forms of the test, one 
for men and one for women. Each item presents a situation which 
might readily be encountered in school, on the street, or in a 
store or bus. From two to four possible responses are offered. 
The examinee selects that option which most nearly represents 
what he would ordinarily do. Choices range from aggressive to 
submissive and are weighted in such a way as to differentiate be- 
tween the two attitudes. Scoring weights for the separate items 
were determined experimentally, and the total score shows the 


strength of the examinee’s typical behavior on a dominance- 
submissive scale. 


Scope. The A-S inventory is designed for use in high schools, 
in colleges, and for adults. 


Scoring. Scoring is by stencil, the answers being weighted + 
(plus) for dominance and — (minus) for submissiveness. Per- 
centile norms are provided for high-school and for college stu- 
dents, and for adult men and adult women. The A-S Study is 

"often useful in educational and vocational guidance. In many 
Occupations, such as nursing, teaching, library work, and clerical 
jobs, a strongly dominant attitude is a liability rather than an 
asset. On the other hand, in positions requiring leadership, 
dominant behavior and self-confidence are crucial when decisions 
are to be made. The A-S Study is especially valuable when com- 


bined: with а 


ptitude and other tests. The test has satisfactory 
reliability. ¦ 


Study of Values (Allport-Vernon-Lindzey)* 

Description. This questionnaire sets out to gauge the strength 
of six basic attitudes, described as follows: theoretical (marked 
by dominant Anterest in the discovery of truth, the rational 
approach to life); economic (interests lie in practical affairs); 


* Published by the Houghton Mifflin Co., Boston, Mass. 
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aesthetic (places greatest value on form and beauty); social (chief 
interest in people); political (primary interest in power, influence, 
and renown); and religious (committed to mystical values, seeks: 
to comprehend the universe). The Study assumes that a person’s 
philosophy of life is revealed by the strength of his basic 
attitudes. See Figure 7-3. 


Scope. College students and adults. 


Scoring. To score, one simply adds the weights assigned the 
various items. The total score for each of the six Values can be 

lotted on a profile to show graphically the relative strengths of 
the individual's attitudes. Norms are for college students, for 
men and women separately, and for some occupational groups. 
The Values inventory has shown expected differences between 
medical’ and theological students and characteristic differences 
among other occupational groups. The inventory is useful in 
counseling and in personnel selection. It is also valuable in fore- 


casting the direction of a student's attitudes. 


FIGURE 7-3 Specimen Items from a Study of Values 


(Answers are indicated by checking or marking.) 


Theoretical у. Economic: 

1. The main object of scientific research should be the- discovery of truth 
rather than its practical applications. (a) Yes; (b) No. 

Religious v. Social: 

9. Which of these c 
(a) high ideals and reverence; 

olitical v. economic: 

wovld you prefer to do during part of your 

(if your ability and other conditions permit)— 

nal biological essay or article 

untry where you 


Һагасіег traits do you consider the more desirable: 
(b) unselfishness and sympathy? 


Aesthetic v. theoretical v. p 
10. Which of the foilowing 
next summer vacalion 
а. write and publish an origi 
b. stay in some secluded part of the co 


iate fine scenery 
nis or other athletic tournament 


ew line of business 


сап арргесі 
c. enter а local ten 
d. get experience іа зоте п 


“Study of Values.” Reproduced by permission of Houghton 


From Allport-Vernon-Lindzey. 
Mifin Company. 
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INTEREST INVENTORIES 


The interest inventory is essentially a self-report or survey 
covering a person’s own interests, values, preferences, and feel- 
ings over a wide range of activities. The importance of interests 
early became apparent in industry when studies of worker 
efficiency made it clear that job success depends as much on moti- 

_ vation as orf aptitude and training. In the school, knowledge of a 

student’s dominant interests is of real significance for the coun- 
selor or teacher. The printed interest inventory supplies sys- 
tematic information about a student’s attitudes, feelings, and 
personality trends which otherwise could be revealed only in a 
long interview, if at all. Students often report patterns of interest 
which differ widely from their stated educational and vocational 
goals. The astute counselor may be able from study of the in- 
ventory to suggest occupational areas which the student hitherto 
had not even thought of. 

The interest inventory is generally more acceptable in the 
school than is the attitude scale or the personal data (adjustment) 
blank. Many students resent the personal questions of the adjust- 
ment blank and sometimes find them emotionally disturbing. For 
this reason, they conceal their true feclings, especially if they are 
not conventional or socially acceptable. An interest inventory is 
not likely to be faked or responded to adversely. Examinees find 


it impersonal, less prying, and often interesting in itself. Hence 
their appraisals are usually honest. From an interest inventory, 
a counselor gets a clearer idea of a 


student’s occupational aspira- 
tions. He may get, too, valuable clues as to a student’s personality 
ere example, his desire for security rather than for ad- 
venture, for active rather than passiv. 

: а ve roles, for people rather 
than books. 8 { ү 
i ape o aum interest inventories are vocational and were 
ү E. or adults, As а result, they are not very useful below 
the eig nth grade. This is not a serious disadvantage, however, 
since the interests of clementary 


A children are often i 
5 : 2 uncrystallized 
and may be superficial, unreliable, and unrealistic The moving 
i o 
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pictures, TV, and romantic stories invest certain activities (that 
of the actor, the game hunter, space adventurer, and detective, 
to cite a few) with artificial glamor. Moreover, a pupil’s informa- 
tion ,concerning many occupations—time required, aptitudes 
needed, financial returns to be expected—is often meager and 
distorted. Even in high school and college, when choice of voca- 
tion becomes crucial, information on many occupations is not 
available. A number of pamphlets describing the requirements 
for various occupations will be helpful to counselors and students 
(see page 253). 
This section w: 
able for pupils whose reading 
and two for high-school and 


Occupational Interest Inventory 
Kuder Preference Record 
Strong Vocational Blank 


ill describe three interest inventories, one suit- 
level is up to sixth-grade standards, 
college students and for adults. 


Occupational Interest Inventory* 


Description. This inventory provides scores in three interest 
areas. First, there are scores in six basic fields of occupational 


ial (personal contacts, service fields); 


interest: personal-soci 
natural (outdoor activities, farming); mechanical (machinery 


design, building, constructing things); business (activities of the 
business world, the “profit motive”); arts (music, literature, 
drama); sciences (chemistry, engineering, biology). Second, cer- 
tain items are designated verbal, manipulative, or computational, 
and scores in these areas provide information as to the direction 
of one’s interests. Finally, the attempt is made to gauge the level 
of an examinee’s interests—whether his interests identify him 
with simple routine aspects of a job or with the more expert 


performances and skills. А 
The six basic fields of occupational interest show considerable 


overlap, and their identity as strictly separate compartments of 


alifornia Test Bureau, Los Angeles, Calif. 


* Published by the C 
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interest is doubtful. At the same time, the Occupational In- 
ventory does give the counselor 2 better notion of the intensity 
and range of a person’s dominant interests than could be obtained 
from casual conversation, and it furnishes many clues to type and 
level of commitment. Vocational or other advice based on in- 
terest scores should always be tentative, however, and subject to 
confirmation from other sources. The items of the Inventory 
are forced-choice in form. The Manual gives many suggestions as 
to the ways in which results may be interpreted. A few samples 
from the intermediate form of the Inventory will show the kinds 
of question asked. The large letter before a question gives the 
interest field, the small letter the interest level, and the symbol the 
interest type (whether verbal, manipulative or computational). 


Part I 
Directions: Draw a circle around the letter of the activity you 


prefer. For example, if you prefer to drive an ice- 


cream truck and sell ice cream, draw a circle around 
A 1 as shown below: 


(& DDrive an ice-cream truck and sell ice cream. 


ПЕ 1 Wrap articles in the shipping department of а store. 
However, if you prefer the second activity, draw a circle around 
Е 


1. A second item, to be marked according to the same direc- 
tions is { 


AK 14 Conduct visitors t 


hrough art galleries and museums 
E 14 Help build autom 


obiles, ships, or airplanes 
Part II 


Directions: Below you will find three activities under each 
number. You are to choose the one you prefer to do 
of the three in each group. Indicate your choice by 
marking the letter preceding the activity. 


# 11 Design or construct st 


ained glass, 
plastic figures 


metal ornaments or 
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b 11 Make pottery, statues or book ends í 
d 11 Carve wood or stone or make metal ornamental figures 


Scope. The Intermediate Inventory begins with grade 7 but 
may be used with bright sixth graders. The Advanced Inventory 


is for grade 9 and for adults. 


Scoring. Percentile ranks for the six basic interest fields may 
be read from the appropriate tables. Standard scores from con- 
verted raw scores are also provided. Part scores may be repre- 
sented graphically by means of profiles for a clearer comparison 
of interest-strength. The working time for the test is thirty to 
forty minutes. Scoring is simple: the items designated by letters 


and symbols are counted. 


Kuder Preference Record* 
Description. The Kuder Preference Record (Form ВІ) is а 


widely used vocational interest blank. There are 360 items in all, 
arranged in groups of three. In each set of three, the examinee is 
asked to indicate which activity he would like most and which he 
would like east. Response is made by punching а small hole with 
nswer is recorded on a specially prepared answer 
der the test blank. The samples below (given as 
ord) are for illustration: 

nd a number of activities listed in groups 
of three on the following pages. Read over the three 
activities in each group. Decide which of the three 
activities you like 70st. Note the letter in front of 
it and punch a hole beneath the 1 beside this letter 
in the column at the right, using the pin with which 
you have been provided. Then decide which activ- 
ity you like Jeast and punch а hole beneath the 3 
beside the corresponding letter in the column at the 


right. 
ә Published by Science Research Associates, Chicago, Hl 
2 vocational and 1 personal. 


a pin, and the a 
sheet placed un 
examples in the Rec 


Directions: You will fi 


]. There are 3 forms, 
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Example #1 
A 1 з 
P. Visit an art gallery © P с 
О. Browse in a library OQ е < least 
1 
R. Visit a museum most > @В O 


(The punch in the hole beneath 1 beside R shows that the ex- 
aminee would most like to visit a museum. The punch in the hole 
beneath 3 beside Q means that he would least like to browse in 


a library.) 
Example {#2 
1 3 
S. Collect autographs OSO 
1 3 
T. Collect coins most > @ TO 
2 1 3 
U. Collect butterflies OU @ < least 


(The punch beneath 1 beside Т 
most like to collect coi 
that he would least lik 


means that the examinee would 
пз. The punch beneath 3 beside U means 
€ to collect butterflies.) 


All items are of the forced-choice type in that the examinee 
has to ma i imi 


scientific, 
and clerical, 


с жнь Ќ 
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over several broad areas rather than in fairly specific occupations. 
There are, for example, fifty-three occupations grouped under 
scientific interests. Items in the Record were first organized on 
a logical basis in the light of everyday experience and common 
sense. Later, items were analyzed statistically in order to isolate 
clusters of items highly correlated. These clusters were taken to 
reveal a core of interest. А 

The Kuder Record relies for its validity primarily on content 
analysis and logical relations. The number of choices offered 
and their nature sometimes confuse students; and the inability 
to find clear-cut preferences may lead to dissatisfaction with the 
forced-choice aspect of the test. Below the eighth grade the 
reading level is probably too high, and the Record should not 
be used. The fact that the scoring plan does not weight sharply 
strong vs. weak interests has been another criticism of the test. 
At the same time, the Kuder Record is an excellent measure of 
the range of expressed interests, and as such is valuable in edu- 
cational and vocational guidance. It is often possible to point 
out to a student that he has expressed many interests not in line 
with his vocational goals. The median reliability coefficient for 


the nine interest-areas is .91. 


Strong Vocational Interest Blank* 

Description. This was the first vocational interest blank and is 
still the best-known. There are forms for men and women. The 
Blank has gone through several revisions and in its latest form 
comprises four hundred items grouped under eight categories. 
These are occupations (likes and dislikes), school subjects, 
amusements, outdoor and indoor activities, responses to peculiari- 
ties of people, choice of activities, comparison of interests, and 
evaluation of personal abilities. The examinee indicates his choices 
by circling or marking. Answers to the items are given numerical 
weights, obtained by comparing the replies of a defined occu- 
pational group (lawyers, for example) with the replies of people 


* Published by the Stanford University Press, Stanford, Calif. 
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in general. In all, forty-five occupations or areas of interest are 
covered by the Blank. 


Scope. For men and women. 


Scoring. A person’s score on a given scale (his interest in teach- 
ing, for example) is found by totaling the plus (4-) and minus 
(—) credits obtained from the options he has marked. A separate 
key is used for each vocation. Thus an examinee’s blank may be 
scored for the interests of an engineer, a physician, and a sales 
manager. Point scores are converted into standard scores to 


afford a direct comparison. A useful scale of letter grades is” 


also available: A represents close identification of interests with 
the given vocation, B+, B, and B— somewhat lesser agreement, 
and C+ and C a very different interest pattern from that of the 
Occupation under study. For example, a college student may have 
A and B scores in the interests of a minister and social worker, and 
C+ and C scores in the interests of a mathematician, or physicist. 
Somewhat less time-consuming, and often more valuable than 
specific vocational scores are the scales for interest clusters. There 
are eleven of these clusters, for exa 
interest in social science, interest i 
Ог school superintendent) ; 
neer), and business- 


SUMMARY ON PERSONALITY INVEN 


. Validity. Insofar as an inventor 
experts agree are гејеу 


TORIES 


questions which 
tested, the question- 


v includes 
ant to the area being 


| 
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naire has content validity. The adjustment inventories (PD 
Sheets) are made up of items drawn from texts on abnormal psy- 
chology and cover conditions which have been found to be 
symptomatic of mental illness. The interest inventories have been 
validated experimentally against a number of criteria: expressed 
interests of successful professional and businessmen, successful 
completion of training courses, ratings for work success, staying 
in an occupation vs. leaving it, and degree of job satisfaction. 
Correlational analysis has been used to locate clusters of items 
which embrace a common core of interest or a community of 
interest patterns. Follow-up studies of the Strong Vocational 
Interest Blank show that men tend to stay in occupations for 
which they expressed strong interests as students and to change 
occupations for which their expressed interests were weak. 

In using interest inventories, several precautions should be 
taken. It is well to remember that interest and aptitude are not 
the same thing, and that many youngsters express interest in 
vocations for which they have little capacity. Again, the interests 
of young people—especially those below the age of 25—often 
change markedly. Adolescents may express unrealistic interests 
which change drastically later on. More than one determination 
of interests, therefore, should be made. Finally, it must be 
remembered that advice about occupational families is much, 
safer than advice about specific jobs. Any inventory, personality 
or interest, should be supplemented by school and intelligence 
records, ratings for health, appearance, motivation and socio- 


economic status. 


Reliability. The reliability coefficients of most inventories is 
high— .80 or more. As interests change over a period of time, 
reliability determinations can be relied on for short periods only. 


Scaling. Inventories are usually scored by assigning weights to 
the various options presented. These points are converted into 
percentile norms, standard scores, and sometimes letter grades. 
Norms for the adjustment inventories are most often for stu- 
dents, less often for occupational groups. The interest inventories 
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report norms for occupational families (Kuder) and for specific 
occupations (Strong). Test Manuals provide many useful sug- 
gestions for the interpretation of the inventories. 


SUGGESTIONS FOR FURTHER READING 


Anastasi, A. Psychological Testing. New York: Macmillan, 1954. 


Freeman, F. S. Theory and Practice of Psychological Testing (Rev. 
edition). New York: Holt, 1955. 


Jordan, A. M. Measurement in Education. New York: McGraw-Hill, 
1953. 


"Travers, R. M. Educational Measurement, New York: Macmillan, 1955. 


SUGGESTIONS FOR LABORATORY WORK 


1. One of the best Ways to become acquainted with a personality inven- 
tory is to take it yourself. Members of the class should take as many of 
the questionnaires as are available, score them, and draw up profiles 
where called for. 


2. Examine the Manual for the Kuder Preference Reco 
What is said about validity, reliability, 


QUESTIONS FOR DISCUSSION 


1. In an adjustment inventory, the number of positive symptoms be- 
Comes the score. What is meant b 


y saying that Stanley is on the median 
fora personal data sheet? 
2. Which interest blank, the Kuder о, 


r the Strong, is more appropriate 
for high-school students? à diii 
3. How might an interest inventory be used in studying a child's per- 
sonality trends? 


4. How closely related are interests and aptitude? Does the relation- 
ship change with age? 
i 5. Why are personal data blanks of little value when administered as 
group tests"? 

6. Under what circumstances do you think the interest. inventory 
would be most helpful? At what age levels? Give reasons for your 
answers, 
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7. What factors limit the usefulness of paper-and-pencil adjustment 
inventories? 

8. A high-school senigr expresses a strong interest in engineering, but 
his interest inventory score does not confirm this interest. What would 
you as counselor suggest to him? 

9. The Strong Vocational Interest Blank has а key for various specific 
occupational interests—dentist, banker, carpenter, for example. What 
difficulties do you see in such restricted interest patterns? 

10. Why is a personal data sheet easier to fake than an interest blank, 
even when the items are not forced-choice? 


CHAPTER 8 


OBJECTIVE-TEST ITEMS AND . 
SHORT-ANSWER TECHNIQUES 


There are at least two reasons why the teacher interested in 
guidance should be familiar with the main types of objective test 
items. -In the first place, the most widely used standard group 
tests are made up of items of the objective sort. (See Chapters 
4 and 5.) Hence, a knowledge of the strengths and weaknesses 
of objective questions will enable a teacher to make a more dis- 
criminating choice among several tests of intelligence, educa- 
tional achievement, or aptitude proposed for use in a school. 
Second, a teacher-made test is greatly improved when the teacher 
knows the principles which govern the writing of objective type 
items and the assembling of them into a test. 
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This chapter will describe some of the better-known—and 
more widely used—verbal objéctive type items. These include 
the true-false, multiple-choice (best-answer), matching, comple- 
tion, and short-answer essay questions. The advantages and dis- 
advantages of each of these item types are listed and examples 
given to illustrate errors to be avoided in writing items of each 
type. Objective tests employ numbers, geometric forms, pictures, 
and diagrams, as well as words. (Figure 8-1 provides a number of 
illustrations.) Some of the varieties of non-verbal items frequently 
encountered in standard tests fall under the following heads: 


1. Number Series Completion.* The examinee is asked to com- 
plete a series of numbers—which are related in some way— 
by the addition of one or more appropriate numbers. 

2. Figure Completion. The examinee must complete a figure by 

. the addition of a line or other detail. 

Likenesses and Differences. From a list of pictures showing 
objects or activities, the examinee is required to select several 
which belong together, or to select an item which does not 
belong with the others. 1 

4. Picture Completion. The examinee is to complete a picture 

` from which one ог more items have been omitted. 

Errors in a Drawing. The examinee must locate and correct 


errors in a drawing. M А 
The examinee is to arrange a set of pic- 


6. Arranging Pictures. 
tures in orderly fashion so as to tell a story. 


Most non-verbal items are variations on the multiple-choice 
type. Non-verbal items are frequently used in tests designed for 
young children. (See page 119.) 

Comparison of Objective Items and Essay Questions 


The traditional essay question often covers too much ground, 
and is open to large errors in scoring and interpretation. Con- 
sider the question “Discuss the causes of the War of 1812” as 


* This test is also classified as verbal. 


FIGURE 8-1 Objective-Test Items 


Directions: The examiner reads several statements about each set of pictures. The 
student is told to put a + 
n in the ( ) after the number 


Ja if the statement is true, a 
» O if it is false. 


m Example: The examiner 
je reads, "Un cheval vient 
у» de s'abbattre sur la route.” 
yn The student puts a + or О 


эт in the ( ) after the number 
da of the statement. 


FER Beene reeee 


x 


/ 


Directions: What chemicol 

property determines which of the materials 
Coal Will be placed nearest the ash pit? 
-Wood. (1) heaviest residue (4) kindling temperature 
Paper (2) reduction of coal (5) paper 
(3) ashes given off (6) combustion 


Answer ( ). Student Puts in the number of the answer. 


== тте C UR M MTM DNE 


Directions: Each of the follow’ 
by 5 possible answers. For ea 
question and write its number 


ihg incomplete statements or questions is followed 
Ҹһ item, select the answer that best completes the 
оѓ lette} on the line to the right, 


31. A claw hammer is shown in Picture 


(b) polish metal, (c) drill holes, 
13678 — (d) take dents out of metal, 
32. A chisel is shown in picture (e) caulk metal. 2-Е 
24567 ——— 40. Too! +2 can be used to (a) mork 
33. A ball peen hammer is shown in metal, (b) file metal, (c) drive 
picture 13568 ا‎ 9 screw, (d) fasten a bolt, 


39. Tool =1 can be used to (a) file metal, $ (e) lock a bolt. | 


ЕЕ 


Represented by Picture, Drawing, or Diagram 


you 
TRUE FALSE CANNOT 


B’ Directions: If the two 
equal circles whose J TELL 
centers are О and O’ 
B have <AOB= «A'O'B E 8 o 
Ar then arc AB—arc A'B’. 
A 


Directicns: Mark two things good to ect. 

Ce 
( eS 

from the pattern in Exomple X? 


Directions: Which of the five figures can be made 


Ееее а 


сїшгез in each row are alike in some мау. 
picture among the four to the 


Directions: The first three pi 
Decide how they are alike, and then find the one 
right of the dotted line that is most like them and mark its number. 


D 


Directions: Mark the one thing not like the others. 
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an example of a common form of essay question. Answers to аг 
question will almost certainly include material that is true an 
relevant, material that is ambiguous, material that is clearly 
erroneous, and: material that is mostly padding. It becomes well- 
nigh impossible for two or more readers to evaluate the answers 
to such a question in the same way. However, when choices in 
objective test questions are recorded by checking one of several 
possible answers, circling а number, or underlining a word or 
phrase, the grade on the test will be the same whether the scoring 
is done by a clerk or by an expert. And the answer will be right 
or wrong. 

Examinations composed of objective items possess several other 
advantages over questions of the essay type. The objective item 
not only eliminates unreliability due to personal opinion but is 
the more easily scored, is economical of time, and allows for a 
wider sampling of material. Furthermore, 
forces the student to answer a question di 
opportunity to equivocate or dodge, and 
more dependable measure of what a student knows. On the nega- 
tive side, the objective item may provide little Opportunity for 
the examinee to display his understanding and organizing ability. 
When poorly made, the objective item may lay too much stress 
on rote memory and unrelated bits of information, 


the objective test item 
rectly, gives him little 
is, for that reason, a 


Defining the Purpose of the Test Item 
It is neci 
intend 


l. Elicit information 
understanding of a 
. movement. 


(often fairly specific) which reveals an 
process, principle, situation, or historical 
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2. Require the examinee to demonstrate knowledge and use of 
technical terms and concepts. 

3. Give the examinee a chance to show his ability to apply a 
principle in the solution of a problem, draw a conclusion, 
arrive at a generalization. 

4, Call forth responses which will reveal the examinee’s atti- 


tudes, interests and personality traits. 


Not every item, of course, can be fitted neatly into one of 
pe) will cut across several. 


these categories. Some (many, we ho 
Nevertheless, each item should be written to achieve a definite 
purpose, to call out some important bit of knowledge, under- 


standing or application. 


Assembling Test items 
In the process of making an objective test, the type of item 
to be used must be decided upon and the items written, before 
we are ready to assemble them into tentative form and try out 
the test. Several problems arise: determining the difficulty of 
the items and their discriminative poWer, drawing up directions, 
and preparing а key and scoring sheets. Methods for carrying 
out these procedures will be treated in Chapter 9. 


TRUE-FALSE ITEMS 


The true-false test presents 4 series of statements or questions 


each of which is to be marked “Т” (true) or “F ” (false). Instead 
of circling one of the letters “Т” ог “F,” the examinee may be 
asked to circle “Yes” or “No,” or to write +(plus) or — (minus), 
or in some other way to designate а positive or negative answer. 
One of the earliest objective forms, the T-F test is still widely 
used in group intelligence as well as in educational achievement 
and aptitude tests. It has been criticized as being a measure of 
rote memory, a test of detached and unrelated facts, and as often 


being ambiguous and equivocal. Such strictures are justified when 
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the test is poorly or carelessly made. There is a large element of 
guessing in T-F tests, too, and good items are not easy to con- 
struct, however simple the process may seem to be. But when 
well made, T-F items have valuable possibilities arising from 
their scope and flexibility. The chief advantages and disadvan- 
tages of the T-F item may be summarized as follows: 


Advantages: 


1. It may be used with a wide variety of materials. 


2. It may be scored easily and objectively. 

3. It is the easiest objective type to construct. 

4: It makes possible an extensive sampling of material in a rela- 
tively short space. 4 

It is a tinte-saver, thus allowing for frequent testing. 

The directions are readily understood and followed. 


5. 
6. 


Disadvantages: 


1. It is often ambiguous and confusing. 


2. Itis open to guessing and to chance effects, 


3. Much subject matter cannot be stated as unequivocally true 
or false. 


4. Itmay readily 
information. 


It may, overstress rote memory at the expense of under- 
standing. 


become a test of detached and unrelated bits of 
5: 


„Зоте of the rules useful in Constructing teacher-made tests are 
given below. In judgi 


у ng the adequacy of printed T-F items, it 
will help to note whether these rules have been observed. 


before each question is 


i the letters at the end of 
a statement, thus Scattering his answers over the page Circling 
or marking saves time in Scorin 


errors, since the letters wri 


re 
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On the test paper: On the answer sheet: 
OPE «ТМ. 1. OF 
ЖОО сыг М 2. T (B 


2. Make the number of true statements equal to the number 
of false statements. The scoring formula for T-F items is 


Score — Right — Wrong 
or Score = Total — 2 X Wrong 


Either of these formulas corrects for guessing, and both give 
the same result provided the pupil has tried all of the items. 
Suppose for example that there are sixty items in the test, and a 
pupil gets forty right and twenty wrong. Then his score is 40 — 20 
or 20, or 60 — 2 X 20 or 20. If the child does not try all of the 
items, the two versions of the formula will not give the same 
result and the first (R — W) should be used. 

If an examinee guesses at every item, he should have one-half 
of the items right and one-half wrong, and his score (R — W) is 
properly zero. If an examinee attempts only thirty out of forty 
items in a given examination, his score may be corrected to a 
total of 40 by adding one-half of the untried items, that is, half 
of 10, to his number right. (Presumably he would get one-half 
of the untried items right by guessing.) It is not necessary to 
correct every paper to the number of items in the test. But test 
scores for a class are the more fairly compared when all are 
based upon the total number of items in the test. 

The correlation between number right and (R — W) is per- 
fect when all of the items of the test have been tried. Hence, 
when a child’s score has been corrected to the total, number 
right may be taken as the score instead of (R — W). The ques- 
tion of whether to tell an examinee to guess has excited much 
controversy, partly because of the opprobrium attached to the 


term guessing as related to school examinations. A good general 
rule is to instruct t it only those items which he 


is sure he doesn't know; ven when not entirely 


certain of the answer, but never to guess wildly. Since the exam- 


. 
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inee has been exposed, at least, to the subject matter of the test, 
the chances are better than even that his answer will be based 
on some information, even if it is vague and uncertain. Hence, 
a Т-Е answer is more likely to be right than wrong. 


3. Avoid opinionated and trivial (or trick) items. 


Examples: T F Character is more important than intelligence. 
T F The ABC Test of Mental Maturity contains 75 
items arranged into 6 sub-tests. 
T F William Collins Bryant is the author of Thana- 
topsis. 
T F One-half of a perfect correlation is .50, 


The first of these items calls for a value judgment, which may 
be true or false; the second and third ask for trivial information; 
and the fourth is a trick questions which happens to be false. 


4. Avoid ambiguous statements, 
false, and those containing negativ 
Examples: T F Socio- 

war. 


T F Wiliam Jennings Bryan, the great Commoner, 
was twice elected president of the U.S. 
T F Not every teacher is careful to avoid having a 
student dislike his subject, 
T F Not all instincts are maladaptive. 
The first item is ambiguous; the second is partly true and partly 
false; and the 3rd and 4th are confusing because of the negative 
form in which they are stated. Double negatives are especially 
hard to decipher. 


those partly true and partly 
es, especially double negatives, 
economic factors are often the cause of 


5. Avoid textbook lan 
items encourage rote 
taken out of context. 


^ Examples: T F The role of the teache 
- tablish satisfying goals 


guage and verbatim quotations. Such 
memory and are often ambiguous when 


г is to help the pupil es- 
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T F Heredity determines what а man сап do, en- ~ 
vironment what he does do. 


Textbook verbiage aids in making a correct guess. 


6. Avoid specific determiners, such as all, none, always, never, 
every. Broad generalizations introduced by these words are 


usually false. 
Examples: T F Feeblemindedness is always present in delin- 
quency. 
T F Corporal punishment is never justified. 
T F All ministers lead lazy lives. 
These items are all too general and all are incorrect. 

The T-F item is not so popular among teachers as it was 
formerly, and it is not found so often in standard tests. It is still 
ranked high, however, and is perhaps the quickest way of sur- 
veying a wide range of material. When supplemented by other 
test forms, T-F is 2 valuable objective item. 


MULTIPLE-CHOICE OR BEST-ANSWER ITEMS 


The multiple-choice item consists of a statement, question, 
phrase, or word followed by several responses only one of which 
is correct. Multiple-choice is one of the most flexible of the 
objective-recognition-ty pe forms. It is a favorite with teachers 
when making their own examinations, and is most widely em- 
ployed in the standard printed forms. Multiple-choice items can 
be so constructed as to measure information, comprehension, 

bility to interpret data. The 


understanding of principles, and a 1 
test form is applicable to most subjects and to most materials. 
Some of the strengths and weaknesses of the multiple-choice 


item can be summarized as follows: 
Advantages 


]. Answers are objective 
2. Items may be written to me 


and judgment. 


and are rapidly scored. 
asure inference, discrimination, 
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3. Guessing is minimized when four or five choices are allowed. 

4. Items may be constructed to measure recall as well as recog- 
nition. 

Disadvantages 


1. Items Аге often too factual, stressing memory unduly. 
2. More than one response may be correct or very nearly correct. 
3. It is difficult to exclude clues. 


4. Distractors—that is, incorrect but plausible answers—are 
often hard to find. 


Rules for constructing multiple-choice items for teacher-made 


tests and for judging the adequacy of such items in printed tests 
are as follows: 


1. Vary the position of the correct response: put the right 
answer in the first, second, third, fourth positions equally often. 
A scoring formula for multiple-choice items is 

Score = Right — (Wrong) 

(n—1) | 
in which 7 = the number of choices, usually four or five. This 
formula is used to correct for guessing on the assumption that 
each response is equally likely. "This conjecture is correct when 
the examinee has no idea of the right answer; but in educational 
achievement examinations as well as in other tests, it is a ques- 
tionable hypothesis. Distractors differ greatly in plausibility and 
likelihood; and since the student presumably has some knowledge 
of the question, he is more likely to mark the right than a wrong 
answer. In most educational achievement tests, taking the num- 
ber right as score saves time and is accurate enough for most 
purposes. It must be remembered that in a given test the number 


of options must be the same for each item if the above correction 
formula is to be used. 


2. Do not include responses which are so unlikely or implausi- 
ble or so unrelated to the question as to give the answer away. 


Distracting responses should distract, not confuse. 
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Examples: The function of a flower is to 
give pleasure to mankind 


attract insects 
illustrate the modification of leaves 


produce seed 


The capital of the United States is ‘ 
Washington 

Rome 

Tokyo 

London 

. — ——Honolulu 


The principal crop of Iowa is 

__ pineapples 

= =cormt 

— — —oranges 

______ bananas 
In the first example, assuming the fourth choice to be the cor- 
rect one, the distractors are all rather silly. In examples two and 
three, an examinee would have to be almost totally ignorant of 
geography to be taken in by the distractors. 

3. Do not provide wrong answers which are plausible enough 
to mislead the good student because they are close to the right 
answer. Lhe good student is often led astray by knowing а good 
deal—but not quite enough—about a question, whereas the poor 
student does not know enough to be misled by а plausible but 


wrong answer. 


What was one of the 


the War of 1812? 


the introduction of a per 

alism 
destruction of the U.S. bank 
шш of the Jeffersonian party 
mmi collapse of the Federalist party 


Example: important immediate results of 
iod of intense section- 
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The fourth response is keyed as the correct one. But 39 per cent 
of eight hundred high-school pupils, all of them superior stu- 
dents, checked the first option as correct. Apparently the first 


answer is plausible to students who know a good deal about the 
War of 1812. 


4. Do not give away the correct answer by providing clues 
such as (a) familiar textbook phrases, (b) having the right 
option consistently longer or shorter than the wrong options, 
(c) repeating the words of the question, (d) asking questions 
to which the answer must be singular or plural, with only the 
correct response being in the right number. 


Examples: In what major labor group have unions been organized 
on an industrial basis? (Circle one letter.) 
A. Congress of Industrial Organizations 
B. Railway Brotherhoods 
C. American Federation of Labor 
D. Knights of Labor 
E. Workers of the World 
The meaning of the German word Gestalten is (Check 
one) 
———a response 
———@ just-noticeable-difference 
———a stimulus 
configurations 
————а perception 


A man hears a.loud noise and runs to the window. 
This is an example of 

motivation 

— ——-memory image 

— ——stimulus-response 


posthypnotic suggestion 
purposive behavior 

In the first of these examples, the adjective "industrial" in the 
question gives the answer away. In the second, if the student 
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knows that Gestalten is the plural of the German word Gestalt, 
he has the answer as "configurations." In the third, the textbook 
phrase “stimulus-response” is a clear clue. 


5. In a multiple-choice vocabulary test, none of the response 
choices should be as difficult as the test word. The difficulty of 
response words can be determined from their, frequency in 
Thorndike’s Teachers’ Word Book. Response words should be 
of the same part of speech as the test word, and only one should 


be correct. 


Good example: An irksome task is (a) pleasant, (b) engrossing, 
(c) instructive, (d) wearisome 

Poor example: Do not despise him mearis do not (a) hate, 
(b) malign, (c) deprecate, (d) dessicate him 

In the second example, some of the response words are more 

difficult than the test word. This is not true of the first example. 


atements followed by a series of 
an questions in which the answers 
е examinee 


6. Direct questions or st 


options are usually clearer th 
in the statement. In the latter form, th 


are imbedded i 
must read through the statement for each option. 


A 10-year-old receives a percentile rank of 40 
on a test of arithmetic. This means that 

he is above the mean of 10-year-olds on 
the test. 

he exceeds 60 per cent of 10-year-olds, 
40 per cent of 10-year-olds did worse 
than he. 

61 per cent of 10-year- 
score. 

Percentile rank shows the 
above, (b) above, (c) at, 


below the given score. ; 
The second example is more difficult to decipher than the first. 


A test made up of multiple-choice items takes more time, to 


Good example: 


olds exceeded his 


per cent (a) at or 


Poor example: 
(d) below, (е) at or 
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construct than a test of T-F items. Furthermore, good multiple- 
choice items are harder to prepare than T-F items, since it is 
often difficult to find acceptable distractors. The advantage of 
the T-F item is largely offset, however, by the fact that multiple- 
choice items are more searching and demand a more organized 
knowledge of the subject-matter. Multiple choice is regarded 
by most test experts as the best of the short-answer forms. 


MULTIPLE-RESPONSE ITEMS 


The multiple-response item is a variation of the multiple- 
choice type of question. Essentially it presents a statement or 
topic followed by a number of possible answers, several of which 
may be checked as correct. The multiple-response examination 
tests several aspects of a subject, and is useful in obtaining infor- 
mation from tables and charts, This examination form is often 
called a check list. The advantages and disadvantages, as well as 
rules for construction given above for multi 
multiple-response items as well. 

Two examples follow: 

Example: 


ple-choice, apply to 


Under each of the following psychological doctrines, 
viewpoints, or systems, indicate by a cross (x) those 


implications or consequences which are characteristic 
of that doctrine. 


1. English Associationism 

—— persisting self 

———Summation and integration of mental states 

universal categories of reason 
mental faculties 
persisting motor-response systems 
Purposive Psychology (McDougall) 
———imageless thought 
introspection as the primary method 

—S-R units 


motivation in terms of instincts 
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doctrine of the unconscious 
j — — — —conative tendencies 
Each of these items can be described by more than one choice. 


MATCHING ITEMS 


In the matching test, one list of words, names, phrases, for- 
mulas, or statements is to be matched against another list. The 
test may consist of (a) a list of names in one column to be 
matched against a list of achievements in a second column; 
(b) a list of terms to be matched against a list of definitions; 
(c) labels to be matched against charts and diagrams; (d) authors 
to be matched against books, dates and events. 

The matching item possesses the advantages of interest and 
variety as well as ease of scoring. It is, furthermore, somewhat 
easier to construct than the multiple-choice item. Matching has 
bcen frequently used to test the relationship between dates, events 
and various facts. On the negative side, the matching item often 
measures recognition memory rather than understanding, and is 
especially open to clues. Nor do matching items ordinarily test 


ability to organize facts or to apply principles. 
Rules for making up matching test items may be set down as 


follows: 


1. Do not include too many items in the lists: 10 or 12 is the 


maximum, 5 or 7 often enough. When lists are long, examinees 
must spend too much time hunting through them. Have the 


number of items in the column from which selections are to be 
cr in the list to be matched. This 


made larger than the numb 1 
lessens the chances that an examince will match an item correctly 
by a process of elimination. | / 
Example: The following statements are representative of differ- 
ent schools of psychology. In the blank spaces before 
the statements, write the number of the psychologist 


for whom the statement is typical. 


( 
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(1) Adler (7) McDougall 

(2) Angell (8) James Mill Р 
(3) Calkins (9) Pavlov 

(4) Freud (10) Titchener 

(5) Jung (11) Watson 

(6) Kochler (12) Woodworth 


— Sensory processes have the attribute of clear- 
ness, just as they have quality and intensity. 
There is evidence for the existence of. three 
types of native and unlearned emotional reac- 
tions—fear, rage and love. 

The inadequacy, the relative futility, of all 
attempts to ignore the purposive, the goal- 
secking nature of behavior renders behavior- 
ism untenable. 

Any mechanism, except perhaps some of the 
most rudimentary that give the simple reflexes, 
once it is aroused, is capable of furnishing its 
own drive and also lending drive to other con- 
necting mechanisms. ó 


The will-to-power is the great motive in men- 
tal conflict. 

— — The superego represents the repressions of in- 

stinct and dominates the ego. 
Mind is primarily engaged in mediating be- 
tween environment and the needs of the or- 
ganisin. \ 
` — Sensations are one of the primary states of 
consciousness; ideas are the other. 


2. Select materials from one subject-field only, so that a given 


item in column 1 has severa] plausible matches in column 2. 
explain clearly the basis of the matching. 


Example: In column 1 are words which illustrate a number of 
parts of speech; in column 2 is a list of various parts 
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Ј of speech. Determine what part of speech a word is and 
then identify it by putting its number before the 
proper item in column 2. For example, “boy” is a 
noun, and if “boy” were numbered 5, a 5 would be 
placed before the word “noun” in column 2. Arrange 
the choices in alphabetical order. j 


(1) and — —— —adjective 
(2) cat —— — —adverb 
(3) rapidly noun 

(4) jump preposition 
(5) from verb 

(6) rich 


(7) either 
Match the items in column 1 with the appropriate · 


Example: 
items in column 2. 


contagious disease 


A. Harvey 

B. stomach digests food 

C. poison discovered circulation of the 
D. Galen blood 

E. lungs ___ early Greek physician 

F. heart ___supplies oxygen to the blood 
G. measles 


But it should enable a teacher to 
spot g f the material is from the field 
of grammar. The second item is poor owing to heterogeneity in 
the list of choices (names and bodily organs). 

es in alphabetical order, dates and numbers in 
о save the examinee’s time. 


Example: Select the inventor from the first list and put his num- 
1 vention in the second list. 
SS 


The first example is quite easy. 
rammatical confusions. All o 


3. Arrange nam 
sequence in order t 


ber opposite his in 
Atlantic cable 


(1) Colt | 
(2). Edison ____сопоп gin 
(3) Field ___ electric starter 
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(4). Franklin — — Sewing machine 


(5) Howe steam engine 

(6) Kettering © — — wireless telegraphy 
(7) Marconi 

(8) Watt 


(9) Whitney 


4. Avoid clues, for instance, one singular item in both lists, 
the others plural; one item in the list of a different part of speech 
from the.others. Watch for irrelevant (but revealing) 4ssocia- 
tions, such as nationality, which give away the matching—for 
example, if the examinee knows that a certain discovery was 
made by a Frenchman, he will look for a French name. 

The matching item is compact and usually interesting to stu- 
dents. It enables a teacher to cover a wide territory in fairly 
short time. Matching is well suited to rapid surveys of specific 
aspects of a field when persons, events, or definitions are wanted, 


or when these constitute necessary knowledge for further work 
in the subject. 


COMPLETION ITEMS 


In this test form, sentences are presented from which certain 
words or phrases have been omitted. Instructions are to fill in 
the blanks so as to complete the meaning. Completion requires 
recall primarily, but it also demands thought and the ability 
to perccive over-all relationships. Little opportunity is afforded 
for guessing. The chief disadvantage of this test form lies in the 
scoring, which is not entirely objective and is often time-consum- 
ing, and in the fact that too many blanks confuse the examinee 
and make a puzzle out of the test. Completion has been a favorite 
of teachers in their own examination-making, although it is not 
so widely used today as multiple-choice and T-F items. 

Rules for writing completion items and errors to be avoided 
in such items are as follows: 


1. Do not copy sentences and paragraphs directly from the 


Assembling Test Items 203 


textbook, since this puts too much emphasis on rote memory and 
parrot-like learning. Rephrase the language of the text, if that is 


used. 


Example: Human beha 
is a product of ..... 0 


vior, more than that of any other animal, 


Example: Much learning is by trial and ..... 4t 
The first is a poor item. It is out of the textbook and will be 


known by those who recall the textbook language. The second 
is also a poor item—the pat ex ression “trial-and-error” ives it 
p P g 
away. 
possible for the examinee to 
ntence is short. Е 


Example: Civilized тап -- - -+ -- 
This item actually appeared on a 
complete it, or else it can be comp 
most of them not indicative of much knowledge. 

3. Scoring is more objective if words rather than phrases are 
deleted. Blank out key words—those which carry the meaning of 
the sentence Ог paragraph—not unnecessary elements or the 


articles a, an, the. 


Examples: Democracy is that form. of.. шын oeste 
Aena onthe se cnc adi Md 


in 


Democracy is that 
ernment in which 


people Be ae este osu i 
representatives elected .......<.*** 


themselves. 
the better, since the blanks contain 


The first form of the item 15 1 ies 
key words. The second version deletes connecting words whic 
do not carry the meaning of the sentence. 
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jo esie ИТЕ established the first laboratory 
mee Чортек AN Dee eo ES study of psychology in 


the first laboratory in experimental psychology and when it was 
founded. 4 


4. Make the blanks long enough to permit legible answers. 
Have all blanks of standard length to avoid clues as to the length 
of the completing word. 


5. When there are several correct answers, provide for these 
in a scoring key. Alternate answers may be weighted for good- 
ness of completion, but a simple right or wrong scoring is ade- 


quate in most cases. A good plan is to allow one point for each 
Correct answer, none for an incorrect one. 


6. Guard a 
not depend u 
expressions. 


gainst clues by taking care that completions do 
pon (a) grammatical form, (b) pat or textbook 


Examples: Johnny wears his space suit, even when he ......... 
to bed. 


A much discussed 

of heredity and 

In the first item, 

_ the second verb 
“pat” expression 


question is the relative importance 


the first singular verb is a clue to the number of 
- The second item tests rote memory, and the 
“heredity and environment” gives it away. iı 


THE ESSAY QUESTION 

The essay question has been a standby of teachers over the 

years. It is widely used in the 
‚ and in the natur. 
purpose of the essay question 
tion and interpretation, rathe 
knowledge. The form of the essay question is important. Ques- 
tions beginning with “who,” “what,” “when,” and “where” are 
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usually to be avoided when they ask simply for a name (for 
example, Napoleon), a date (1492), an event (Battle of Hastings) 
or a location (New York City). But such questions are valuable * . 
when the information asked for is relevant to the solution of a 
problem, the making of an inference, or the interpretation of 
some event. Questions beginning with “why,” “how,” “with 
what consequences,” or «with what significance" are to be pre- 
ferred to simple fact questions. Questions beginning with such 
words as “discuss,” “evaluate,” “outline” and “explain” invite— 


and usually get—a mass of detail, some not relevant. Such ques- 
tions are useful, of course, when we wish to know how well 


an advanced student can select, reject and organize. But they 
are hard to score and are virtually useless in a broad survey Or 


for the diagnosis of specific blindspots. 


Restricting the Essay Question 


The essay question becomes objective when 
answer form and restricted in coverage. Two m 
trolling the essay question and rendering it more $ 


mentioned. 


cast into short- 
ethods of con- 
pecific may be 


ions. Recall items are essay questions reduced to 
ion is followed by а blank 


rms. Usually a quest 
length. Answers are restricted to short para- 
t, an algebraic equation and its 


ble the completion 


Recall Quest 
the simplest t€ 
space varying in 

raphs, the account of some even 
ap lication and the like. Recall items resem 
type; but they provide for fairly 


restricted. 
Examples: (1) 


free answers and are less 


(2) Name th 
atomic theory, 
tion of each. : 
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(3) List three conditions which must hold true if an 
intelligence test is to yield a constant IQ. 


"The first item calls for a опе- 
for specific but basic knowl 
question, “Discuss the constr 


Problem Situations. A 


questions are asked, eac 
the situation, 


Example: A skillful teacher has been characterized as one who 
(а) maintains a permissive atmosphere. 
(b) avoids negative discipline. 
(c) conforms to the wishes of parents, 
(4) does not use repetitive drill. 
Write one paragraph defendin: 
these propositions—that is, 


line answer. Items (2) and (3) ask 
edge. Compare (3) with the essay 
uction of the Stanford-Binet.” 

problem is stated, and 2, 3, or 4 specific 
h focused on some important aspect of 


§ Ог attacking each of 
four Paragraphs in all. 


ild development is that of 
е evidence bearing on the problem 
g points of view: 


Example: A recurring problem in ch 
maturation. Cite th 
from the followin 
(2) neurological 
(b) co-twin control 


(c) parallel groups 


Scoring ihe Essay Question Objectively 


Perhaps the major weakness of the essay examination lies in 
the unreliability of its Scoring. Scorin 


tive by the use of the following tech 


1. When essay examinations are ma 
usually better agreement between q 


g can be made more objec- 
niques: 

tked anonymously, there 
is ifferent SCOrers, 

2. "There is less opportunity for P 


references, attitudes, and 
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biases to appear when all papers are read for one question at a 
time rather than each paper straight through. Obviously, com- 
parisons can be sharper with this method. 


3. Before reading a question, the teacher can list the basic facts 
which the question is intended to bring out. Points may then 
be assigned to these aspects of the answer. For example, if the 
question deals with a chemical process, the answer list may in- 
clude (a) the necessary equations of the process, (b) the chemical 
elements needed, (c) a diagram of the apparatus, and (d) any 
by-products of the chemical reaction. If the question deals with 
English literature, the answer may include (a) the author’s chief 
contribution, (b) the cultural setting of the time, (c) the influ- 
ence of the author’s work. A check list of key points, with credits 
assigned to each, is a useful technique. Thus, from one to three 


points may be assigned to each part-answer. 


4, If the teacher marks the papers for spelling, writing quality, 
and grammatical expression, as well as for content and organiza- 
tion, credits should be allotted to these aspects of the answer. 

The essay question is a valuable examination form when held 
to one or more defined themes, so that it is scorable. Many 
teachers are so impressed by the general use of objective-type 
items in the standard tests that they are inclined to drop the 
essay entirely. This is a mistake. Many courses, especially ad- 
vanced courses, in literature and in science employ objective-test 
items as a first approach to an examination of the subject. But 
the essay question is the best (perhaps the only) way in which 
a teacher can determine whether a student can organize his 
knowledge and arrange his arguments in logical fashion. Short- 
answer forms should be regarded not as substitutes for the essay, 


but rather as supplementary to it. 


SUGGESTIONS FOR FURTHER READING 


Gerberich, J. R. Specimen Objective Test Items: A Guide to Achicve- 
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York: Prentice-Hall, 1954. 


Travers, R. M. How to Make Achievement Tests. New York: Odyssey 
Press, 1950. 

Wrightstone, J. W., Justman, J., Robbins, 1. Evaluation in Modern 
Education. New York: American Book, 1956. 


Note: Most textbooks on educational psychology and in measurement 
and evaluation contain chapters dealing with objective items. 


QUESTIONS AND PROBLEMS 


1. Write five true-false items in some subject field familiar to you. 

2. Rewrite the five items in number 1 in multiple-choice form. 

3. Construct five matching items in which great scientists, poets, and 
authors are matched against outstanding contributions to science ог 
literature. - 

4. If possible, put the items in number 2 in completion form. 

5. Rewrite the following essay questions to make them more objective 
in answering and in scoring. 4 

а. Discuss some of the proposals for aiding the gifted child. (Hint: 
break down into specific proposals, such as special classes, accelerated 
promotions, extra assignments, and the like.) 

b. Evaluate three of the modern learning theories. (Hint: This topic 
may be subdivided under descriptive labels—behaviorism, for example—or 
under names of well-known theorists.) 

c. Discuss the causes of the industrial revolution. 

6. Point out any errors in the following items: 


1. The Frenchman who developed the first successful intelligence 

test was (1) Kuhlmann (2) Terman (3) Binet (4) Wundt 

2. An efficient man is one who is (1) strong (2) handsome 

. (3) angry (4) pusillanimous (5) capable 

3. T F Edgar Anderson Poe wrote the poem “The Raven.” 

^. T F Lack of emphasis on the three R's is not a serious defect in 

modern educational practice. 

Sob Mx) application of the Golden Rule will make for better 

ving. . - 

6. T F The median of a distribution of scores is the midpoint, 

which is influenced markedly by very high or very low 

scores. 


at 


15. 


. When there is a fire dri 
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criterion of 


. The best way to spend one’s leisure time is to (1) read good 


books, (2) look at TV, (3) dig in the garden, (4) play solitaire, 
(5) relax in an easy chair, 


. The expansion of the binomial (а-ЕЬ)?15..............- 


1. off 


. I borrowed a book 2. off of my roommate.  (Answer)....... 


3. from 


‚ We get the most calories per pound from 


(1) candy (4) potatoes х 
(2) carbohydrates (5) proteins 


(3) vitamins 
ill, the teacher must make sure that her 


observe 


ally be recalled. 


CHAPTER 9 


1 


CONSTRUCTING THE OBJECTIVE TEST 


The classroom teacher needs to know how objective mental 
tests are constructed for much the same reason that he needs 
to know what constitutes a good test item (page 184). Stand- 
ardized tests are now employed routinely in many schools. 
Teachers who give and score tests will be better able to interpret 
results and to appraise what an author says about his test when 
they understand how the test items were selected and put to- 
gether. Even more important, perhaps, the teacher who knows 
a few essential procedures will be able to improve greatly the 
quality of the day-to-day tests which he makes for his own use. 

The construction of a comprehensive battery of educational 
achievement tests is not a task appropriate for most teachers and 


910 
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most schools. Standard printed tests in wide use today are made 
by testing bureaus. These agencies have a staff of experts in item 
writing and construction techniques, technically trained assist- 
ants, and access to large and representative samples and to labora- 
tory and scoring equipment. The classroom teacher can hardly 
hope to match all this. And fortunately it isn’t necessary, since 
his test-making is properly on a much more modest scale. 

This chapter will outline the basic techniques in test con- 
struction. These methods apply whether the test is designed to 
measure intelligence, educational achievement, or aptitude. 


WRITING SPECIFICATIONS FOR THE TEST 


Before he begins to construct an examination, the teacher must 
decide what he wants his test to do. This means that he must lay 
down specifications for the test (page 105). Usually a teacher 
wants to test his students’ knowledge of the fundamentals of the 
subject and to sce how well they can use this knowledge in 

solving problems. Three subject matter tests in different areas 
' will be described in order to show what specifications the 
author had in mind and how he went about accomplishing his 


objectives. 


Columbia Research Bureau Spanish Test* 


This test is designed for high schools and colleges. Part I calls 
for basic knowledge of the language, and Parts II and III require 


understanding of language structure and application of rules of 
erammar. In more detail, Part I is a vocabulary test of one hun- 
d orm. The student is instructed 


dred words in multiple-choice f 
to mark that one of four or five English words which best de- 


fines the given Spanish word. Part II is а language comprehension 
test. There are seventy-five sentences in Spanish arranged in 
order of difficulty; each is to be read and marked “True or 
"False," Part III is concerned with grammar and syntax. This 


* Published by the World Book Company, Yonkers, N. Y. 


212 Constructing the Objective Test 


test consists of one hundred English sentences, each followed 


. H + . а. ее 15 
by an incomplete translation in Spanish, which the examin 
told to complete. 


California Arithmetic Test (Upper Primary, 
Grades 3 and 4)* 


This test is part of a comprehensive educational асетат 
battery, but it may be given as a separate examination. Its objec 
tive is to test for skills in fundamental operations, the identifica- 
tion of consistently made errors, and the ability to apply what 1s 
known to the solution of problems. The eight sub-tests cover the 
four fundamental Processes (addition, subtraction, multiplica- 
tion, and division), facility and skill in following directions in- 
volving numbers, and simple "mental arithmetic" problems. 


The Nelson-Denny Reading Test** 

The authors state the ob 
dict success in college, 
high-school classes on t 


jectives of this test as follows: to pre- 
to enable a sectioning of college and 
he basis of reading skills, to aid in the 
diagnosis of scholastic difficulties. The examination consists of 
two parts, a test of vocabulary and a test of the ability to a 
and understand fairly difficult prose. There are one hundre 

words in the vocabulary test, each word followed by five choices, 
one of which is to be marked as correct. The paragraph-reading 
test is made up of nine selections of approximately two hundred 
words each. Four questions are asked on each paragraph. There 
are five optional answers for each question, one of which is to 


be selected by the examinee. It seems clear that the test measures 
basic knowledge of language as well as the ability to use this 
knowledge intelligently. 


SELECTING ITEMS FOR THE TEST 
In the construction of an examination, 
the form of the question must be conside 


* Published by the California 
** Published by the Houghto 


both the content and 
red. 
"Test Bureau, 


Los Angeles, Calif. 
n Mifflin Co., 


Boston, Mass. 
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Deciding On the Type of Item 
The teacher must first decide what type of objective item 
he wishes to use. True-False and multiple choice are favorites 
for measuring basic knowledge, and multiple choice, matching, 
completion and essay recall are all used to assess understanding, 
interpretation and application. It is probably less confusing for 
the younger students if a sub-test or section contains only items 
of one type and does not switch from one kind to another. The 
test-maker should start with a much larger number of items 
than he plans to have in the completed test. All the questions 
should be read by other teachers of the subject and criticized for 
form and for content. Items judged to be trivial, inappropriate, 
ambiguous, or too narrow in scope should be revised or dis- 
carded, The items which survive this preliminary inspection 
should still number considerably more than.the number of items 
lanned for final use. An excess of items is necessary, since some 
items will always be discarded as a result of the item-analysis to 


follow. 


Arranging the Items in Order of Difficulty 

The questions are now arranged in a rough order of difficulty, 
from easy to hard. For the first try-out, the difficulty of an 
item as judged by several teachers is sufficient for placement. The 
test, as tentatively drawn up, 1$ now administered to a sample of 
for whom the final test is intended—for example, to 
fifth-grade pupils or high-school freshmen. If several teachers 
of the subject co-operate—and thus increase the size of the 
experimental group—the final test will be a better examination 
than it will be if it is administered toa single small class. It is 
always advisable to get as much information as possible on each 


item. Hence, those examinees who take the examination in pre- 
Н j А 

liminary form should be urged to attempt every item, even when 
AL f the answer. The time allowance for the 


they are uncertain o і 
whole test should be generous, 50 that every student Д have 
time to try every item. This may make it necessary to have a 


second testing period. 


students 
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Setting the Time Limit 


The length of the time interval set for the test when put into 
final form will depend on the time available for testing—most 
often one period of about fifty minutes. Time allowances must 
always take into consideration the age of the pupils, type of 
item (amount of computation or reading needed in answering it), 
whether the test is primarily for survey purposes or for diag- 
nosis, and whether speed and/or power are deemed important. 
In examinations which are strictly power tests, the time limits 
should be long enough for all but the very slowest examinees to 
finish. Sometimes, naturally, an examination has to be cut in 
length in order to have it fit into the available time. 


ITEM ANALYSIS 


The two characteristics of an item which we need to know 
about in building a test are (a) difficulty and (b) validity, or 
discriminative power. These two determinants of an item’s good- 
ness are computed from the same tabulation of the test data. Com- 


putation of the difficulty and validity of an item is called item 
analysis. 


Difficulty and Validity in Item Analysis 


The difficulty of an item depends on how many of the exam- 


inecs in the tryout group answer it correctly. An item answered 
correctly by 90 per cent of the group is obviously easier than one 
answered correctly by 50 per cent or by 10 per cent—the last 


being a hard item. Very hard and very easy items are ordinarily 
less useful than items of intermed 


ju an тееп iate difficulty (page 216). The 
validity or discriminate power of an item depends on how well 


it distinguishes between the brightest and dullest pupils in the 


group. If all of the members of the experimental group answer 
an item correctly—or if none does—the item has no validity, 
since in neither case does it Separate the good from the poor 
members of the class. 
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Biserial r in Item Validity* 


The authors of most standard mental tests have used the biserial 
r method (or some approximation to it) in determining the 
үну of the items in their tests. By means of biserial 7, we can 
'9mpute the correlation between success and failure on a single 
iteng.and size of total score on the test, or on some other measure 
of performance taken as the criterion. The size of the correlation 
betwéei item and test score shows how well the item is working 
together with other items—is a member of a team. Items unrelated 
to total score are discarded. 
Steps in the determination of item validity by use of biserial r 


are as follows: 
1. Arrange the test papers in order for total score from highest 
to lowest. 


2. Count off the highest and lowest 27 per cent** of che papers 
—if not exactly, as nearly so as possible. If there are 120 children 


in the “standardizing group,” for example, put 32 in the top and 
32 in the bottom groups. 


3. Count off the number in the high group and the number in 
the low group who pass each item, and express these figures as 
percentages. Suppose, for example, that Item 18 is passed by 60 
per cent of the high group and by 30 per cent of the low group. 
Then from tables prepared for the purpose,t we read that the 
biserial correlation between this item and the whole test is .31. 
For an item passed by 24 per cent of the high group and by only 
3 per cent of the low group, the biscrial 7 is .44. In general, any 


see references at the end of Chapter 2. 


** There are good reasons for choosing 27 per cent. When прш gt 
ability is normal, the sharpest discrimination between extreme groups is ol ка 
when item analysis is based upon the highest and lowest 27 PS cent in sach 
Case. When larger per cents are in the high and low groups, the reliabi туро 
the determination is higher, but the difference between the two groups de- 
creases. On the other hand, when per cents In the high and low groups are 
smaller, reliability falls off but the difference between the two groups increases. 

e, for example, Item Analysis Table by Chung-Teh Fan, publ did by the 


Educational Testing Service, Princeton, N. J., 1952. 


* For the computation of biserial r, 
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item with a biserial r of .20 or more can be taken to be valid if 
the test is fairly long. In a short test, items of higher validity are 
needed. Both hard items and easy items are valid (that is, have 
discriminative power) if they separate the high and low groups. 
An item passed by 15 per cent of the high group and only 1 per 
cent of the low group (a very hard item), for example, has a 
biserial r of .47, whereas an item passed by 92 per cent of the 
high group and 65 per cent of the low group (an easy item) has 


a biserial 7, of .39. Both are good items, though they differ greatly 
in difficulty, 


4. Determine the difficulty of each item by averaging the per- 
centages that pass it in the high and low groups. An item passed 
by 60 per cent of the high group and by 30 per cent of the 


low group, for example, has a difficulty index of .45—that is, 
2 + .30 


5 ) and an item passed by 15 per cent of the high and 
1 per cent of the low groups has a difficulty index of :08. This 
summary method of obtaining difficulties of items is not as 
accurate as is the practice of using the whole group, but it saves 
time and is precise enough for most tests, 


5. It can be shown mathematically that items with difficulty 


indices of .50 or thereabouts are the best items, in the sense of 
being able to differentiate among the largest number of good and 


ems, of course, will be found with 


5 wanted in most school examina- 
in selecting items is as follows: 


Of items passed by 85-100 Per cent (very easy) 

take about 15 per cent 
Of items passed by 50- 85 Per cent (fairly easy) 

take about 35 per cent 
Of items passed by 15- 50 per cent (fairly hard) мч 

take about 35 per cent 
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Of items passed by 0- 15 per cent (hard to very hard) 
take about 15 per cent 


АП of the items should, of course, have satisfactory discrimina- 
tive power. Note that different proportions at the difficulty levels 
follow the normal distribution. 

í Items passed by 100 per cent or by nobody have no validity 
in either case, but sometimes an author will place several very 
easy items at the beginning of the test for psychological effect, 
and a few very hard items at the end to test the very bright pupils. 


6. In using multiple-choice items, it is important for the im- 
provement of the examination to know to what extent good and 
poor students have chosen the various distractors. If the wrong 
answers are illogical, obviously absurd, or otherwise not very 
misleading, the examinee will have little difficulty selecting the 
right option. The item is easier than it would have been had the 
the misleads been more attractive. Information concerning the 
efficacy of misleads can be obtained by tallying the responses of 
the high and low groups to each mislead, as shown below. The 
group considered is the 120 children referred to in the illustra- 
tion above, and there are 32 (27 per cent) in the high and 32 in 
the low groups. The item is of the multiple-choice type with 
four options, and the correct answer is keyed as (b). 


® с d Omissions Total 


Item 26 a 
High group 1 16 8 7 0 32 
Low group 3 7 10 12 0 32 


It is clear that distractor (a) needs to be rewritten, since only 
four in sixty-four chose it. Otherwise, item 26 differentiates be- 
tween the good and poor students rather well. 

A second example shows a slightly different situation. Here (c) 
is keyed as the correct answer. 

Item 10 ИБ © а Omissions Total 
15 11 5 1 32 


High grou 0 
En 9 8 0 32 


Low group 5 10 
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Mislead (b) is chosen by more of the good students than is the 
correct answer (c); and this is true, too, of the poor students. 
Obviously, mislead (b) must be made less attractive or otherwise 
changed so that it doesn’t compete so strongly with (c). Further- 
more, (c) might be strengthened and (a) examined further to 
see why it failed to attract any answers in the high group. 


FIGURE 9-1 Item Analysis Data for Test File 


FRONT OF CARD 


Item 36: What marked change took 


б place іп the political status of 
India in the year 1947? 


1. She received a mandate from the United Nations. 
2. She won her independence from Britain. 

3. Her people were united under Mohammedan rule, 
4. She joined the Arab league. 


BACK OF CARD 


Item 36: 1 2 3 4 
High group: 10 32 6 6 
low group: 19 n 13 9 


Omissions 
о 
2 


Sample: 200 high school seniors, tested in June, 1953 


Validity: biserial r = .41 
Difficulty: 


= 39 per cent 


acter of the experimental group on which the data are based, (2) 
the validity of the item (its biserial 7 with the test score), (3) the 
difficulty of the item, and (4) data on misleads. Figure 9-1 shows 
these data on an item taken from a test in contemporary history. 

When a teacher has accumulated a large file of items, tests of 
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approximately the same range and validity may be made up as 
needed. 


A SHORT METHOD OF ITEM ANALYSIS 


It is wise for a teacher to understand what the biserial co- 
efficient of correlation means and what it does, since the device 
is utilized in many standard tests and is frequently mentioned in 
the literature of testing. At the same time, it is not necessary for 
the teacher to employ the method in order to construct good 
classroom examinations. The difference between a simple count 
of “rights” in selected fractions of the best and poorest pupils 
will suffice as a measure of the validity or discriminative power 
of an item. First, the items should be gone over by several 
teachers, the unsatisfactory items discarded, and the remaining 
fficulty, this determined by the judg- 
ment of the teachers reviewing the items. Next, the test as tenta- 
tively drawn up is administered to a sample of children drawn 
from the classes or age levels to be tested. From here on the 


steps are as follows: 


1. Arrange the test p 
the highest to the lowest. 


2. Count off the 25 per ce 


items arranged in order of di 


apers in order for size of total score, from 


nt* of the best papers and the 25 per 


cent of the poorest papers. If the total group is small (for ex- 
ample, under fifty) take some larger proportion, say the upper 
half and the lower half. Suppose there are eighty pupils in the 
experimental sample (try-out group), so that twenty, or 25 per 
cent, fall in the high group and twenty in the low group. Each 
item may now be examined to see whether it is able to separate 


these two criterion groups. 
3. Determine the number in each of the two criterion groups 
who answer each item correctly. If fifteen in the high group 


SUE EH ining validities, there is no 
. zd. d is used in determining va i 

Unless the biserial 7 meto i ldlv 27 per cent rule; 25 per cent or any 
need to observe the somewhat unwieldly 27 P 


convenient larger percentage will serve. 
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answer an item correctly, and five in the low group get the item 
right, the validity is 15-5, сг 10, and the validity index is 10/20 
or .50.* If all twenty in the high group answer an item cor- 
rectly, and none of the low group gets it right, the validity of 
the item is maximal: 20 — 0 = 20, and the validity index is 20/20 
or 1.00. The lowest validity index of an item by this method is, 
of course, 0/20 or .00. Validity indices run, therefore, from 0 to 
1. There may be a few items of negative validity: more rights in 
the low than in the high group—but such items are rare. Items 


having zero or negative validity must be rewritten before they 
are used or discarded if salvage is impossible. 


4. If Rx = number right in the high group and Rz, = number 
right in the low group, the discriminative power of an item is 
simply (Ra — Rr) or (Ra — Rz) /Nn when written as a validity 
index. Using the same nomenclature, we may write the difficulty 
index of an item as (Ён + Rz)/ (Na + Nx) in which Nx and Ni. 
are the numbers in the high and low groups, respectively. In our 
example above wherein Rz = 15 and Ru = 5, the validity 
index is 10/20, or .50, and the difficulty index is (15 + 
5)/(20 + 20), or .50. If Rz = 18 and Ri = 12, the validity 
index is 6/20, or -30, and the difficulty index is 30/40, or .75. 
Again, if Ra = 10 and Ry, = 2, the validity index is 8/20, or 
40, and the difficulty index is 12/40, or .30. 


5. Select the items having the highest validity indices for the 
final test. Then follow the table on page 216 in apportioning 


items to the various levels of difficulty, if the test is to cover a 
fairly wide range of talent. 

6. It is advisable to examine the misl 
items are to be used. The method out 
in locating distractors which are too 
enough. The first kind are too often 
are taken by only a few. 


cads when multiple-choice 
lined on page 217 will aid 
plausible or пог plausible 
accepted, and the second 


* Validities can be left simply as the difference between the number right 
in the two extreme groups. The chief advantage of а validity index is to put 
validities in a percentage scale, as аге the difficulties, 
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7. A card file of acceptable items will prove useful when a 
teacher wants to lengthen a test or to replace non-functioning 
items. When there are a number of items, a parallel form of the 
test can be drawn up. 

Table 9-1 shows the sort of data which we can expect to get in 
an item analysis of questions administered to a sample of 80, as 
described above. The full table, of course, would contain data 
on all the items and on all the members of the two criterion 
groups. Half of the scores in the high and in the low groups are 
not shown in order to shorten the table, but these omitted scores 
are included in the totals upon which the item analysis is based. 
Each of the two criterion groups (the high and the low) consists 
of twenty examinees. 


Examination of Table 9-1 shows that Item 4 is highly valid and 


that Items 1, 2, and 5 are acceptable. Item 3 has no validity and 
must be dropped or changed drastically. An item with a validity 
index of .20 or more may be considered satisfactory—at least 
tentatively. This figure is arbitrary, however. If the test is 
shortened, the acceptable point for a validity index should be 
raised; if the test is lengthened, it should be lowered. Any item 
with an index larger than 0 has some validity and hence some 
value. Note that in Table 9-1 the difficulty indices range from 
.70 (a fairly easy item) to .30 (a fairly hard item). 


SCORING THE COMPLETED TEST 


ast in T-F form, the point scores will 
W if we wish to correct for 
the correction for 


If the completed test isc 
be simply numbered right, or 
guessing (page 191). In multiple- 
guessing 15 Sy 

Score — R — (СООЛ) 
f choices or options. I 


or К— 
choice tests, 


t is sometimes ad- 
h T-F items, but 
n multiple-choice 


where л = the number 0 ? | 
visable to use the correction for guessing wit 


number right without correction is satisfactory i 
when four or five options are provided. 


TABLE 9-1 


Item Analysis of the First Five Items of a Test Made Upon Two 
Criterion Groups, the Highest and the Lowest 25 Per Cent 
in Total Score. Ni; — №, = 20, and М — 80 


Highest Group Total test score 
(Best 25 per cent) in order ITEMS 
In order of Merit of size 1 2 3 4 5 
1 72 У У d V M 
2 70 0 v 0 У 0 
3 68 У У v v v 
4 65 Vi 0 0 v 0 
5 65 v ۷ ۷ у v 
6 6$ ۷ ۷ ۷ M ۷ 
7 63 ۷ 0 0 ۷ 0 
8 61 У У ۷ ۷ У 
9 60 У У 0 У 0 
10 60 v M У M M 
20 54 OR Y, Os ЛУМ AV. 
Ra “КЕЛ Т 007 
Lowest Group. 
(Poorest 25 Per cerit) 
1 35 У v ۷ ۷ 0 
2 34 OY у; ОО 
3 30 Vv ۷ ۷ v 0 
30 EP TROVE 0n ДАО 
3 27 ШЦ [ус зр IO 
б 23 MAE ROM pSV 7 
4 25 O O OS r 
8 24 v 0 / v 0 
9 23 ۷ ۷ 0 0 ۷ 
ip 23 DENIQUE E LU 
20 12 0 ۷ y 0 0 
= PY ip hia А ай. e 
ce ОТОЛЕ E 4 
a С 7 6 0.09712 4 
(Ra Ву) /N = 35 30 0 60 
(Ra + Rr) / (Nr +N) = : 
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In most cases, it is sufficient for the teacher to express standing 
on the test in point scores or totals. If several classes have been 
tested and it is desirable to compare their performance, percentile 
ranks will be useful. Scaling of teacher-made tests in standard 
Scores or normalized scores is not recommended unless the test 
is to be used throughout a school system. 

Directions for the final test should be explicit, and time limits 
should be given. Manuals for standardized tests may be consulted 
with profit for pointers on, directions and time limits. A test 
should not be so long that most students cannot finish in the time 
allowed. 

The use of scoring stencils will speed up marking when many 
papers are to be examined. In T-F tests, a strip containing the 
answers (a key) may be laid alongside the left-hand margin and 
the answers checked as right or wrong, or simply the right 
answers checked. Separate scoring sheets are useful in dealing 
with multiple-choice and matching items. Spaces are numbered 
on the answer sheet for recording answers to the questions on the 
test. The test blank itself is not marked and may be used more 


than once. 


THE RELIABILITY OF THE COMPLETED TEST 


Perhaps the easiest method of estimating the reliability of a 
teacher-made test, since there is rarely more than one form, is 
by what is called the “split-half” technique. In this procedure, 
the test is administered only once to a sample of examinees, and 
is then divided into two half-tests. The first half-test contains 
the odd-numbered items (1, 3, 5, and so on) and the second half- 
test the even-numbered items (2, 4, 6, and so on).* The correla- 
tion between scores on the two half-tests is now found and 
from this 7 the correlation of the whole test with itself (its self- 
correlation) is predicted by the well-known “prophecy 


* Note that when a test is split into odd and even items, the range of diffi- 
culty in the two half-tests is the same and the split is unique. Not just any split 


into two half-tests is satisfactory. 
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formula.* To illustrate, suppose that in a class of ten seventh 
graders, an English Literature test in multiple-choice form has 
an odd-even correlation of .50. What is the probable self- 
correlation of the whole test? The prophecy formula is 

2 X r (half-test) 

1 + r (half-test) 


Substituting 7 = .50 for the self-correlation of the half-test, we 
have that 


r (whole test) = 


2X .50 or .67 

E50 

This is a satisfactory reliability coefficient (.67) for a single class. 
For standardized tests administered го very large groups of a wide 
range of talent, reliability coefficients will ordinarily be higher— 
:90 or more. For teacher-made tests, however, the reliability co- 
efficients will rarely be more than .60 to .70. Reliability is higher 
over several grades—that is, when the test is given to more than 


one grade. The standard error of a test score can be computed by 
the formula given on page 29, but for 


the teacher-made test 
this is often a needless refinement. 

Reliability coefficients for a teacher- 
be computed from a new class, never 
determining the validities of the test 
the standardization group will always b 
the selection of items was based on t 
low members of the sample. 


r (whole test) = 


made test should always 
from the sample used in 
items. Self-correlation in 
е spuriously high, because 
he scores of the high and 


VALIDITY OF THE COMPLETED TEST 


A teacher-made test in physics or French, for example, will 
always have content validity, even when the sampling is quite 
narrow. Teacher-made tests rarely cover as much material as do 
the standard printed tests. An approximate measure of validity 
for a test can be found by correlating test scores against school 

* The Spearman-Brown Prophecy formula i 


2 is treated in 
dealing with statistical met! 


$ all standard texts 
hod in psychology and education. r, 
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grades in the same subject. This method is not entirely satis- 
factory, since school marks are rarely more dependable measures 
of the subject matter than are the tests. When experimental 
validity is attempted by correlating scores on a teacher-made test 
with grades or with other test scores, a new group must always 
be utilized. Such validation, called cross validation, is necessary 
because the group used in item analysis is a special group which 
has served to select the items in the first place. Cross validation is 
necessary also when the two criterion groups (the upper and 
Jower extreme groups) are selected on the basis of school grades. 
A teacher-made test will of necessity correlate with grades 
achieved by this group, since the group selected the items. 
Perhaps the best way to judge the value of a teacher-made test 
is by its predictive validity. If the test aids the teacher in getting 
a better notion of the individual differences within the class, and 
leads to better understanding of the difficulties of the students 
(meager knowledge, wrong knowledge, and so on), it has ful- 


filled its purpose. 
SUGGESTIONS FOR FURTHER READING 


Bean, K. L. Construction of Educational and Personnel Tests. New 


York: McGraw-Hill, 1953. 
Noll, V. Н. Introduction to Educational Measurement. Boston: Hough- 


ton Mifflin, 1957. 
Ross, C. C., and Stanley, J. C. Measurement in Today’s Schools (3rd 


edition). New York: Prentice-Hall, 1954. 
Travers, R. M. How to Make Achievement Tests. New York: Odyssey 


Press, 1950. 
SUGGESTIONS FOR LABORATORY WORK 


1. Assume that you have tried out 50 T-F items on a class of 40 pupils. 
Draw up a table like that of Table 9-1 showing how you would carry 


out an item analysis. $ 
2. If time allows, construct a test using your class as standardizing sam- 


ple. Multiple-choice items in arithmetic and vocabulary taken from 
E. L. Thorndike's Tbe Measurement of Intelligence may be used con- 
veniently: Thorndike’s book gives items by levels over a wide range of 
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difficulty. Administer a test of about fifty items and item-analyze the 
results by the method given on pages 219-223. 

3. Take a test which has been given to this class or to some other class. 
Analyze the questions for validity, following the method on page 222. 


QUESTIONS FOR DISCUSSION 


1. A sixth-grade teacher has administered a test in fundamentals of 
arithmetic. What analyses of the test data could this teacher make which 
would (a) help his future teaching, (b) be of value to individual pupils? 

2. Under what conditions would it be profitable to correct scores on 
a multiple-choice test for guessing? 

3. In some schools, one teacher makes all the examinations in a given 
subject. What are the advantages and disadvantages of this procedure? 

4. What might an item of negative validity mean? Of zero validity? 


СНАРТЕК 10 


SOME PROBLEMS IN THE EVALUATION 
OF TEST SCORES 


Interpreting Multiple Aptitude Test Scores 

Table 10-1 gives the scores achieved by ten ninth graders on 
the Differential Aptitude Tests (DAT). Scores on any mental 
test are more meaningful when supplemented by the pupils’ 
school grades and by a knowledge of personality traits, interests, 
and ambitions. With this proviso in mind, it will be interesting 


to answer the questions below with references only to the per- 


centile ranks in Table 10-1. 
QUESTIONS ON TABLE 10-1 


1. Which two students show the poorest scholastic ability? In 
what jobs might they do best? 
227 
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2. Which student exhibits the most consistently high level of 
ability? 

3. Which students are likely to have reading difficulties? 

4. Which girl should do well in secretarial work? 

EZ If Joe Kramer wants to go to college, would you encourage 

him to plan to go into engineering? 

6. Would you encourage Larry Edwards to go into his father's 
accountancy firm after graduation from high school? 

7. Jane Goodrich plans to become a medical technician. Would 


you recommend this vocational goal? 
8. Which students will probably find it hard to graduate from 
high school? 


9. Frank Seay's father is an auto mechanic and Frank is inter- 


ested in this work. Do you think it a wise vocational choice? 
10. Is it likely that several students are handicapped by poor 


spelling and language usage? Why? 


Case Studies in Evaluation of Abilities 

The three case studies which follow provide considerable data 
about three pupils, two in high school and one in elementary 
school. Questions are planned to focus upon things to look for in 
evaluating the promise of the pupils being considered. 


1. Case Study of Robert Т. 


Robert is 16-2, a sophomore in high school. He is well-grown 


and makes a good appearance. He is well behaved, quiet, inter- 
ested, though not as a participant, in sports, and does not read 
much. Robert’s father is a house painter; both parents are high- 
school graduates. Robert wants to go to college, and is encour- 
aged to do so by his parents. He wants to be an engineer. 


School Data 
Ninth Grade Tenth Grade (First term) 
English С English — С 
Social Studies B Social Studies D 
Mathematics B Physics B 
General Science B French Р р 
C Physical Education С 


Physical Education 
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Test Data 
Otis Quick Scoring (Form Gamma) 1Q 112 
California Mental Maturity (Language) IQ 110 
California Mental Maturity (Non-language) 1Q 121 


Cooperative General Achievement Test: 


Percentile Ranks 
I. Social Studies 


38 

II. Natural Science 52 

Ш. Mathematics 36 

Kuder Preference Record (Vocational) Percentile Ranks 
Mechanical 63 
Computational 51 
Persuasive 15 
Artistic 12 
Literary 46 
Musical 51 
Social Service 20 
Clerical 50 
Scientific 72 
1. What do 


AM you think of Robert's chances of succeeding in 
college: 


2. Robert’s interests are in the 
they strong enou 


n in Robert's IO’ 2 
pw Q's too great to 
5. How do you int nce between Robert's 
language and non-language 1Q’s? 
6. Would you say that Robert’s school d in keep- 
ing: with Hig tO? grades are not in keep 
7. Do you think Robert 


7 I might be more Successful as a tech- 
nician than as an engineer? 


8. Would you recommend that Robert become a salesman? 
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9. Do Robert’s interests jibe with his achievement test records? 


With his school marks? 
10. Might Robert do well as an airplane pilot? 


II. Case Study of William S. 

William is 18-1, a senior in high school. He makes a good 
appearance, is husky and muscular. William is easy-going and 
aftable; he likes to hunt and is interested in, and good at, sports. 
His father is a successful lawyer and his mother is a college 
graduate interested in club activities. The parents have planned 
for William to study medicine: his grandfather was a well- 
known physician in the community. William has accepted these 
vocational plans but says he is more interested in business and 


sales work. 


School Data 
Tenth Grade Eleventh Grade 
English B English C 
Social Studies B Social Studies B. 
Mathematics D Mathematics D 
Physics (C; Spanish j B 
Physical Education B Physical Education B 
Test Data 
IQ 118 


Terman-McNemar Test of Mental Ability 
California Achievement Tests (A dvanced) Percentile Ranks 


Reading А ; 
Mathematics y 
Language 6 
Differential Aptitude Test (Tenth Grade) КОШ: Ranks 
Verbal Reasoning Г, 
Numerical Ability jd 
Abstract Reasoning E 
Space Relations — T 
Mechanical Reasoning " 


Clerical Speed and Accuracy 
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Language Usage—Spelling 75 
Language Usage—Sentences D 
Kuder Preference Record (Vocational) Percentile Ranks 
4 Outdoor 96 
Computational 32 
Persuasive 90 
Artistic a 
Litera. 

Musical 54 
Social Service 36 
Clerical 40 
"Scientific 26 


1. Do you think that Willi 


am is college material? 
2. Would 


you encourage him to plan for medicine as a career? 
3. Do Wi 


lliam’s grades verify his DAT scores? 
^. Is language a strong area for William? Would you on the 


strength of this, suggest some other vocation than medicine for 
William? If so, what? 


5. What are William’s Strong interests, as revealed by the 
Kuder Record? 

6. Are William’s achievement test Scores in line with his 
school grades? 

7. Do you think William might be happier and more success- 
ful in business? Or in the study of law? Give reasons for your 
answers. 

8. William's IQ does not 
give any reasons why this s 

9. 'The Kuder scores а 


jibe with his D 


AT scores. Can you 
hould be so» 


h de re more helpful than the DAT in | 
counseling William, Would i 


m * And how would you explain the 
apparent contradictions? fe 
III. Case Study of Mary S. 

Mary is 11-8, in the second half of the sixth grade. She is 
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pleasant and well mannered, but is judged by her teachers to be 
“nervous” and overanxious. Mary wants to be a teacher. Her 
father is an auto salesman, with high-school education; her 
mother is a housewife, with junior-college training. There are 
three other children in the family. 


School Data 
Fifth Grade Sixth Grade 
Reading С Reading B 
Social Studies (6 Social Studies C 
Arithmetic B Arithmetic (C; 
Science © Science , C 
Language G Language B. 
` Test Data 
Kuhlmann-Anderson Intelligence Tests IQ 110 
Metropolitan Achievement Tests Grade Equivalents 
Reading 6.1 
Vocabulary 6.8 
Arithmetic Reasoning 54 
Arithmetic Comprehension { 5.2 
English 6.2 
Spelling 6.6 
History 5.6 
Science 4.6 
1. In what subjects is Mary weakest? 
2. Would you encourage her to plan for teaching as a career? 
3. Is Mary college material? Give reasons for your opinion? 
4 


‚ Could Mary do office and clerical work successfully? 
5. Would it help to have a Stanford-Binet 1О for Mary? Give 
reasons for your answer. 


Sociometric Testing 


From observations in school and out, most teachers geta fairly 
good idea of the social and personal relations within their class- 
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rooms. They soon come to know which children are leaders, 
which are well liked and popular, which are disliked or ignored, 
and which are picked on and teased. It is sometimes valuable for 
a teacher to have, in addition to his own opinion, some measure 
of the attitudes and feelings of the pupils regarding each other. 
When data cf this sort are collected systematically, 
put into a table or expressed in the form of a soc 
last is a pictorial or graphic representation of the 
relations within some specified group, often a class. 


The usual procedure is to ask the pupils to designate the class- 
mate by whom they would rather sit, 


or the child (or children) 
with whom they would prefer to play ball at recess, or to make 
some other choice of а companion in a real life situation. Table 


they may be 
iogram. This 
interpersonal 


TABLE 10-2 
Sociometric Tabulation 
CHOSEN 
David Anita Sally Gary Karl Janet Jack Helen Laura Ruth 
DU ne 
Anita . 1* à 2 
| Sally 1 2 
Сагу ы 2 
$ Karl 1 2 
E Janet 1% 
= Jack 2 1 
Helen 1 2 
Laura 1 2 
Ruth 


2 1 
fidere Dx E eee NLN 


Choices 3 2 0 1 1 3 0 0 D 0 
2nd 


Choices 1 0 


“ye 
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10-2 shows the responses made by ten fifth-grade pupils when 
asked to nominate their first and second choices of a child to 
work with on a class project. (The table reproduces only part 
of the data for a class.) 

A first choice is shown by a 1 under the name of the child 
chosen, a second choice by a 2. An asterisk (*) denotes that the 


FIGURE 10-1 Sociogram for 21 Kindergartners, 13 Boys and 
8 Girls 


Group -Kindergarten 
Number- 21 
Boys -13 Jeff 


Girls 


Strong (3) choices ———»; Reciprocals z 4—14—»; Partial reciprocals = qy 


; From Northway, Mary L., and Weld, Lindsay, Sociometric Testing. Reproduced 
by permission of the niversity of Toronto Press. 


lion of Test Scores 
«as in the Evalun 
на Galle 


utual. Thus, David chose Gary and 
ice fi first pass Ed Aniva and Janet each named the other 
н chosen РУ P rus at the bottom of the table shows 
pipe choice. T =a chosen child, with three firsts and one 
p^ o be es s 5 next most popular, with three firsts. Sally is 
ЫСУУ О мм edis and three of the girls (Helen, Laura and EDS 
and one boy (Jack) receive no first choices. Tabulation of t e 
responses as given by the children will provide the “choice 
information that the teacher wants. ] \ 

А. more striking method of representing the social relations 
within the group is afforded by the pictorial sociogram shown in 
Figure 10-1. The responses were those of twenty-one Kinder- 
garten children—thirteen boys and eight girls. The stars (pop- 


ular, often chosen children) are quickly located as are also the 
isolates, whom no one choos 


es. The two-headed arrows indicate 
mutual choices,  . 

When used wisely, a sociometric test can be helpful to a 
teacher, especially when the class is too large for close personal 
observation. Some of the things which a sociogram may reveal 
М are the following: 


1. Good and bad personal relations, 
Choices, or the existence of cliques. 

2. Clusters and cleavages resulting from differences in race, 
religion, sex, and €conomic conditions of families. 
3. Differences between in-school and out-of-school social 
groupings. 


free interchange of 


thod has some 
more harm than good if the morale of 
poor discipline, frequent chan 


I 8¢ of teachers, or other disrupting 
influences, For example, choj 


D 
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ing, vary from time to time, and are quite unreliable. The socio- 

ram, therefore, is not foolproof. At the same time, in the hands 
of a skillful teacher, sociometric testing will often provide new 
insights into the personality traits of pupils and thus aid in 
discipline and in remedial work. 
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choice for first place was mutual. Thus, David chose Gary and 
was chosen by Gary, and Anita and Janet each named the other 
as first choice. The summary at the bottom of the table shows 
David to be the most chosen child, with three firsts and one 
second. Janet is the next most popular, with three firsts. Sally is 
chosen by no one, and three of the girls (Helen, Laura and Ruth) 
and one boy (Jack) receive no first choices. Tabulation of the 
responses as given by the children will provide the “choice 
information that the teacher wants. | 

А. тоге striking method of representing the social relations 
within the group is afforded by the pictorial sociogram shown in 
Figure 10-1. The responses were those of twenty-one Kinder- 
garten children—thirteen boys and eight girls. The stars (pop- 
ular, often chosen children) are quickly located as are also the 
isolates, whom no one chooses. The two-headed arrows indicate 
mutual choices. 


When used wisely, а sociometric test can be helpful to 4 
teacher, especially when the class is t 
observation. Some of the thin 
are the following: 


оо large for close personal 
gs which a sociogram may reveal 


1. Good and bad personal relations, free interchange ОЁ 
choices, or the existence of cliques. 


2. Clusters and cleava 


ges resulting from differences in race; 
religion, sex, 


and economic conditions of families. 


3. Differences between in-school and out-of-school social 
groupings. 


The sociometric method has some disadvantages and may do 
more harm than good if the morale of the class is low because of 


poor discipline, frequent change of teachers, or other disrupting 
influences. For example, choices may be trivial or deliberately 
false, or some pupils may 


take the “test” as an occasion to express 
hostility and resentment against other pupils or against the 
teacher. Moreover, the choices of young children are often fleet- 


گر 
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ing, vary from time to time, and are quite unreliable. The socio- 
gram, therefore, is not foolproof. At the same time, in the hands 
of a skillful teacher, sociometric testing will often provide new 
insights into the personality traits of pupils and thus aid in 
discipline and in remedial work. 


— iti 


"9 


APPENDIX А 


STATISTICAL SUPPLEMENT 

In order to understand and use test results wisely, a teacher 
should be familiar with those statistical concepts most often em- 
ployed in mental testing. One of the best ways to accomplish 
this is to work through the computation of the basic statistics. In 
Chapter 2 a number of statistical terms were defined and their 
application illustrated. In subsequent chapters these statistics have 
been frequently employed. If, when a statistic is first mentioned, 
the student will work through its derivation—for example, the 
tabulation of a frequency distribution or the computation of 
an r—the value of the statistic to mental testing will be clarified. 
A second or even a third review is often helpful. A good analogy 
here is the habit of looking up unfamiliar words in a dictionary. 
Sometimes a word must be looked up more than once before its 
meaning is clearly grasped. 


This Appendix deals with the following topics: 


The Frequency Distribution 
The Frequency Polygon and the Histogram 
4 Averages: Mean, Median, and Mode 
Measures of Variability: Range, О, and SD (o) 
The Coefficient of Correlation 


Drawing up a Frequency Distribution 


Test scores are more readily dealt with when they have first | 
been organized into a frequency distribution. Suppose that Miss 
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Norton has administered a standard test to her class of forty 
pupils in social studies, and that scores are as follows: 


37, 38, 36, 31, 28, 33, 24, 19, 25, 34 
16, 43, 22, 20, 26, 44, 27, 19, 25, 34 
33, 24, 22, 20, 44, 27, 31, 28, 38, 17 
31, 26, 34, 17, 19, 20, 22, 24, 26, 29 


Table A-1 shows these forty scores tabulated into a frequency | 
distribution in which the interval is буе score units. Steps in | 
setting up a frequency distribution follow: 


TABLE A-1 
Frequency Distribution of Forty Scores on a Social Studies Test. 


Intervals Midpoints Tallies 


SlacSoaus 


(1) Determine the range, or th 
and lowest scores, Examinin 
range to be from 44 to 16, or 28. 


е gap between the highest 
5 our set of forty scores, we find the | 


€ is very |, 
small (less than 25), “Ту агре (200 or 300, say) or very 
3) 1 nge by the interval 
This gives the approximate (within оп 
In Table А-1 the range of 28 divided by five gives 5.6, and the 


pus of intervals is six. Five jg a better choice than is-a larger 
r а . 
Smaller unit. For example, an interval of three will spread the 


size tentatively chosen. 
е) number of intervals. 4 
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data out too thin (into ten intervals), whereas an interval of ten 
crowds all forty scores into three intervals. 

(4) Write the beginning and end of each interval as a score: 
for example, 15-19. Actually a score of 15 represents the interval 
from 14.5 to 15.5—that is, a distance along an ability scale; and 
19 represents the interval 18.5 to 19.5. Hence, the lowest interval 
begins at 14.5 and ends at 19.5; the second interval begins at 19.5 
and ends at 24.5, and so on. Writing score limits instead of actual 
limits saves time and avoids the confusion which often arises 
when one interval ends and the next begins with the same figure. 

(5) Tally each score under its proper interval as shown in 
Table A-1. Then write the sum of the tallies opposite each 
interval under f (frequency). Sum the f's to give N. 

Note that the midpoint of the topmost interval is 42—that is, 
2.5 from 39.5 and 2.5 from 44.5. The midpoints have been 
entered in the second column. When scores have beca arranged 
into a frequency distribution, all of the f's within a given interval 
are represented by the midpoint of that interval. 


The Frequency Polygon 


Figure A-1 shows the frequency polygon of the forty scores 
tabulated into a frequency distribution in Table A-1. Two axes, 
a horizontal or X-axis and a vertical or Y. -axis, are drawn at right 
angles. Score intervals are laid off at regular distances along the 

-axis, or baseline, beginning with 15, the lower limit of the 
first interval. The six scores on the lowest interval are represented 
by a point six units up on the F-axis and just above 17, the mid- 
point of interval 15-19. The nine scores on the next interval 
are represented by a point nine units up on Ё and just above 22, 
midpoint of the interval 20-24. The other f's are drawn in 
in the same manner. 

When all of the poir:s are joined by short straight lines, we 
have the outline of the frequency polygon. To complete the 
figure—that is, to bring it down to the baseline at each end— 
two intervals are added, one (10-14) at the low end and other 
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FIGURE A-1 Frequency Distribution of the Forty Scores in 
Table A-1 


10 


— y - 
(Frequencies) 


M oO hb йд о чо о 


- 


10 


—x— 


(Scores) 


(45-49) at the high end. The f on each of these intervals may 
be taken as 0, and hence 


the frequency polygon reaches the 
X-axis at 12 and 47, q У poyg 


In order to provide a symmetrica} 
too squat nor too thin—units in 
chosen. A good rule is to select units which will make the height 
of the figure about 2/3 of its length. In Figure A-1 the maximum 
f (10) is about 2/3 the baseline length of the polygon. 


figure—one which is neither 
X and Y must be carefully 


The Histogram 


The frequency distributi 1 is again represented 
in Figure А-2, this time by a histogram, or column diagram. 
The main difference betw 


Figure A-2, for example, the height of the Б le 3526; 
its width being the lengt Pst rectang 


1 14.5 to 19.5. Each 


= 
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FIGURE А-2 Histogram of the Forty Scores in Table 4-1 


= 
(Frequencios) 


(Scores) 


frequency rectangle begins at the actual lower limit of the inter- 
val and ends at the actual upper limit, The histogram presents 
the same facts as the frequency polygon and there is often little 
to choose between them. When two or more frequency distribu- 
tions are represented on the same axes, however (as for example, 
the scores of two classes or two sections of the same class), the 
frequency polygon is to be preferred to the histogram, because 
the vertical and horizontal lines in a histogram coincide and are 
often difficult to disentangle. 


COMPUTATION OF AVERAGES 


"There are three averages in common use: the mean, the median 
and the mode. 


The Mean (M) 


We have defined the M on page 20 as the statistic found by 

ividing the sum of the scores by their number. When scores 
аге put into a frequency distribution, the scores classified within 
any interval lose their identify and are represented by the mid- 
Point of that interval. This necessitates a slightly different pro- 
cedure from that used with unorganized scores. 
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In Table A-2, section A, the midpoint of each. interval is 
multiplied or “weighted” by the frequency which lies UM 
it in the f column. This gives the fX column and the sum 9 


this column (1100 in Table А-2) divided by N (40) gives 2 
mean of 27.50. The formula is 


a: 
Ме үн 


TABLE А-2 
Computation of the Mean from a Frequency Distribution 
Data are the forty scores in Table А-1. 


A. LONG METHOD 


Intervals Midpoints f fX 
40 – 44 42 3 126 
35 — 39 37 4 148 
30 – 34 32 8 256 
25 – 29 27 10 270 
20-24 22 198 
15 — 19 


xíX 1100 
MILLE 
N = 27.50 


B. ASSUMED MEAN METHOD (SHORT METHOD) 


Intervals Midpoints f К ix 
40 – 44 42 3 a б 
354—139 37 4 3 А 
30 – 34 8 


0 
20 - 24 22 9 um e. 
151—219 17 6 = T 
40 


—21 
AM + ci 


M 


27.00 + .50 
27.50 
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where XfX is the sum of the products f X X and N is the 
number of cases. 

M can always be computed by the "Long Method" just de- 
scribed, but it is generally computed by the Assumed Mean, or 


2 “Short Method." When N is large, the Short Method reduces 


calculation and saves time. Moreover, the Short Method is man- 
datory when standard deviations and coefficients of correlation 
are later to be computed from the same data. Computation of M 
by the Assumed Mean or Short Method is shown in Table A-2, 
Section B. Steps are as follows: 

(1) Assume a mean, called the AM, near the center of the 
frequency distribution and if possible on the interval having 
the largest f. In our example, the AM is taken at 27, midpoint of 
interval 25-29, and this interval also has the largest f. 

(2) In the column x * lay off deviations from the АМ of 27 
in units of interval. The midpoint of interval 30-34—that is, 32— 
deviates five scores or one interval from 27; and the midpoint 
of interval 35-39 deviates two intervals from 27, and so on. 
Below the AM, the deviations of the midpoints of the two inter- 
vals—22 and 17—are —1 and —2. The midpoint of the interval 
25-29—that is, 27—15 the assumed mean, and 0 is entered in 


the 4 column opposite this interval. } 
(3) Multiply each a^ by its f.and enter the product in the fx’ 
column. The sum of this column is 4—25-21—from which the 
Correction (c) is calculated. The formula is 
Sfx’ 


с = xm 


М 
and c = 4/40 er .10 in our problem. 
(4) Multiply с, the correction in units of interval, by the 
length of the interval or i, to give ci, the correction in score units. 
n our example, ci = .10 X 5 = .50. 
(5) Add the correction, ci, to the AM to get M. In Table A-2, 


4 * x’ denotes the deviation of a midpoint from the AM; that is, x” = Mdpt. — 
M. Deviations from М are denoted by х. 
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section B, the M = 27.00 + .50, or 27.50, thus checking the 
computation in A above. 


The Median, or Mdn 


The median is defined as that point in the distribution below 
which and above which lie 50 per cent of the distribution. The 
median is also described as the fiftieth percentile (Pso) and the 


TABLE A-3 
Computation of the Median and Q from a 


Frequency Distribution 
Data are the forty scores in Table A-1 


EEE ee ЗЕ‏ ڪڪ 


Intervals t 
40 — 44 3 
35 — 39 4 
30 — 34 А 8 
25 – 29 10 25 
20 - 24 
15 - 19 PES 
6 6 
N = 40 
N/2 = 20 N/4 = 10 3N/4 = 30 
5 20 — 15 
By formula, Median = 24.5 + (29) 


= 27.0 
30 — 25 
By formula, Оз = 29.5 + 5 =, 


32.63 


1 


10 — 6 
Ву formula, Qi = 19.5 + 5 9 


= 21.72 
_ 32.63, — 21.72 


2 


Q 


= 5.46 
“See = = ш УН ИННАА 


E 
D: 


—— 
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second quartile (Оз). Computation of the median in a fre- 
quency distribution is shown in Table A-3. (The Q, or quartile 
deviation, which is found in the same way as the median, is 
also computed in the table.) Steps are as follows: 

(1) Take % of N and count into the distribution from the 
low end until the interval containing the median is reached. In 
Table A-3, N/2 — 20, and counting into the distribution from 
interval 15-19, we locate the median on interval 25-29. The 
two lowest intervals contain 6 + 9 or 15 f's, and it is clear from 
this cumulated f that the twentieth score must fall on interval 
25-29. 

(2) Apply the following formula: 


Мт (Eg — cm f) 


fm 
in which 
1 = lower limit of interval on which Мал lies 
N/2 = V, of the number of scores 
cum fi = sum of scores on intervals below 1 
fm = frequency on the interval containing the Мал 


i length of the interval 


In our example in Table A-3, 1 = 24.5, lower limit of interval 
containing Mdn; N/2 = 20; cum fi = 15; fm = 10; i = 5. 
Substituting in the formula, we have 
20 — 15 
Mdn = 24.5 (2512) 
" j| eS 
= 27.00 


"The median can be found by counting into the distribution from 
either end, but it is generally easier to start at the low end. 


The Mode 


The mode is usually taken as the midpoint of the interval 
which contains the largest f. In Table A-3, the mode is simply 27, 
the рон of the interval 25-29. This “midpoint” mode is 
often called the crude mode. The mode may be calculated more 
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accurately, but since it is usually a preliminary statistic it is 
hardly worth while to do so. 


COMPUTATION OF MEASURES OF VARIABILITY 


The means or medians of two distributions are often the same 
or nearly so, but the spread or scatter of the scores around the 
central point is quite different. One class, for example, may 
show the same mean but a much greater range of talent than 
another. Knowing the variability of performance within a class 
may Бе more useful than knowing its average or typical per- 
formance, 

There are three measures of variability all of which are used 
in mental testing: the range, the О and the SD (o). 


The Range 


The Q, or Quartile Deviation 

Q, the quartile deviatio 
between the seventy- 
the distribution. To 
into the distribution 
for example, we cou 


n, is defined as one- 
fifth and twenty-fifth pe 
find these two percentile: 


half the distance 
rcentile points in 
5, We must count 
. In Table А-3, 
(the third quartile 


. each О, (the first 
entile). The formula for Qs is 


Оз =1 a m Д 
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and the formula for Qi is 


o. =1 + (ÊÊ me л) » 


in which i 
1 = lower limit of the interval upon which the 
quartile point falls 
i the interval 


cum fı = cumulated f’s up to the interval containing 
the quartile wanted 
fm = f on the interval which contains the quartile 
In Table A-3, % of N is 30. Counting into the distribution 
from the low end, twenty-five scores take us to 29.5, lower 
limit of 30-34, which is the interval containing Qs. The f on 
this interval is 8. Substituting in the formula, we have 


30125 
в =з + {20°—У 


= 32.63 


To obtain Q, we count off и of N or ten scores as shown 
in Table A-3. Six scores take us to 19.5, lower limit of the 
interval 20-24, the interval which contains Qı. The f on this 
interval is 9. Substituting in the formula, we have 


О, = 19.5 + (£ y 5) 
12172 
From the two quartile points, Qs and Qi, we find Q by sub- 
stituting in the formula 
DE MONDO 


2 


32:63 + 21.72 
and in our example, Q — (22-172) ог 5.46. 


The Standard Deviation, SD or с (sigma) 


The standard deviation, or ©, is a measure of variability com- 
puted around the mean; hence it is usually calculated from the 
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same frequency distribution as the mean. SD, or c, is the most 


what different procedure. 
-4 for the same forty scores 


s follows: 
TABLE A-4 


Computation of the Standard Deviation (o) from a 
Frequency Distribution 


Data are the forty scores in Table A-1, 
Intervals f x fx fx 
40 — 44 3 3 9 27 
35 - 3% 4 2 8 16 
30 ~ 34 8 1 8 8 
25 - 29 10 0 +25 с 
20 - 24 9 =й ~9 9 
15 - 19 6 —2 =R 24 
М=% ET 84 
AM = 27.00 em UY Ч 10 c? = 1 
N а 

=; рі © JE 

TENEN e Nag O = 5-361 1446 
С = 7.23 


gives 27 as the јх 
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(4) Sum the fx’? column to give Sfx”. 

(5) Compute the correction (с) as in Table А-2. Square ¢ 
to get c^. Be sure that c is left in units of interval. 

(6) Find c from the following formula: 


oni E >7 


,. In our example, i = 5, Xfz? = 84, N = 40 and с? = 01. 
Substituting these values in the formula, we get c = 723. I: 
will be clear that in computing c we make use of the same 
quantities used in finding the mean; only the Sfx’? is new. 


CORRELATION 


Correlation (page 27) is the correspondence or relationship 
between two sets of test scores. Degree of relationship is ex- 
pressed by a coefficient of correlation (r) along a scale which 
extends from —1.00 to +1.00 through .00. There are several 
methods of computing correlation, of which the product-moment 
method is the most often employed in dealing with test scores. 
Calculation of a product-moment r is illustrated in Table A-5. 

Table A-5 shows the computation of the correlation between 
test scores in reading and arithmetic achieved by ten children 
in the fifth grade. The sample is much too small to give an ade- 
quate indication of the relationship between these two variables, 
and our table must be taken as a much simplified illustration of 
correlational method. 

The coefficient of correlation in Table A-5 is .23, revealing a 
positive but quite low relationship between the two tests. The 
first test (reading) is designated X, and the second test (arith- 
metic) is F. Note that, in order to compute the correlation, we 
must first find the deviation of each child's X-score from My 
and the deviation of his ¥-score from My. Each deviation from 
Mx (53) is entered in the x column, and each deviation from 
My (21) is entered in the y column. Each x and y is then squared 
and entered in the x° and у? columns, and the sums of these two 
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TABLE A-5 


Correlation between Reading and Arithmetic in the 
Fifth Grade 


(N = 10) 
Reading Arithmetic 
Pupils 00 (ү) х у х2 у? ху 
John 60 26 “2. 5 49 25 35 
Carol 55 24 2 3 4 9 6 
Ann 63 18 10 —3 100 9 —30 
Betty 40 21 —13 0 169 0 0 
Louise 52 17 =l —4 1 16 4 
Tom 61 20 8 —I. 64 1 —8 
Bill 43 15 —i0  —6 100 36 60 
Joan 56 25 3 4 9 16 12 
Dick 44 23 —9 2 81 4 —18 
Carl 56 21 3 0 9 0 0 
530 210 


Zx? = 586 5у2=116 xxy —61 
Mx=53.0 My=21.0 


61 
~ v386x 116 = 23 


columns are found. In the last column (ху), 
tions of each pupil are multiplied with due г 
the sum of the xy column is determined. Fin: 
xy column is divided by the square root of 
Sx? and Xy? to give the coefficient of correlat 


CUP ©. 
Vix? + Sy? 


the x and y devia- 
egard for sign, and 
ally, the sum of the 
the product of the 
ion. The formula is 


the correlation 
(see references, 


ج 


APPENDIX B 


PUBLISHERS OF MENTAL TESTS 


Teachers who do much testing should write the publishers 

below for their catalogs. 

Bureau of Publications, Teachers College, Columbia University, 
New York 27, New York. 

California Test Bureau, 5916 Hollywood Boulevard, Los Angeles 
28, California. 

Educational Test Bureau, 720 Washington Avenue, S.E., Minne- 
apolis 14, Minnesota. 

Educational Testing Service, Cooperative Test Division, 20 Nas- 
sau Street, Princeton, New Jersey. 

Houghton Mifflin Company, 2 Park Street, Boston 7, Massachu- 
setts. 

Psychological Corporation, 304 East 45th Street, New York 17, 
New York. 

Public School Publishing Company, 509-513 North East Street, 
Bloomington, Illinois. 

Science Research Associates, Inc., 57 West Grand Avenue, Chi- 
cago 10, Illinois. 

C. H. Stoelting and Company, 424 North Homan Avenue, Chi- 
cago 24, Illinois. 

Stanford University Press, Stanford, California. 

World Book Company, 313 Park Hill Avenue, Yonkers 5, New. 
York. 
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GLOSSARY 


achievement test A test designed to measure pupil performance 
in some school subject. 

age equivalent The chronological age assigned to an obtained 
score on a test representing the typical (average) age correspond- 
ing to the score. Example: reading age = 8-4. 

agenorm Typical performance on a test expressed in age equiv- 


alents. 
alternate forms of a test Equivalent or parallel forms of a test. 


aptitude test A test designed to measure potential ability; spe- 
cifically, a test to predict future success in a school subject or in 


a vocation. 
attitude test А test designed to measure likes or dislikes in a 


given area. Example: attitude towards war. 
battery А group of tests, often combined into a team, designed 


to measure a variety of abilities or aptitudes. 
biserial r A coefficient of correlation often used to measure the 


discriminative power of an item in analysis. 
central tendency А measure typical of a group of scores; a mean, 


median or mode. 
chronological age (C.A.) Life age expressed in years and months. 


Thus, 10-4 means 10 years and 4 months. 
completion items Test questions in which the examinee must fill 


in blank spaces in a statement or sentence in order to complete 


the meaning. 
correlation The tendency for one test to be related (or unre- 


lated) to another test. 
criterion Any measure of performance with which a test is 


compared in determining validity. 
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256 Glossary 


deviation IQ A standard score found by converting raw scores 
into a distribution with a mean = 100 and a o of 15 or 16 points. 
` diagnostic tests Tests designed to reveal pupils’ strengths and 
weaknesses in school subjects. 

discriminating power A test item which Separates good from 
poor students has discriminating power. E 
distractor An option in a multiple-choice test that is incorrect. 
essay items Test items calling for a relatively free response. 


evaluation Appraisal of a pupil’s performance; may include in- 
school and out-of-school behaviors. 

frequency distribution 
in order of size. 
grade equivalent 


An arrangement of test scores into groups 


The grade score assigned to a given obtained 
ample: A score of 42 on an achievement test 


may have a grade equivalent of 6.5 (halfway through the sixth 


grade). 


group test A test that may 


group or class at the same time. 
individual test 


item A single question on a test, 
item analysis The 
validity of test items throu 
matching items Test j 
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Glossary 257 


mean The arithmetic average of a set of test scores. 

median The point that ‘divides a frequency distribution of 
scores into two equal parts. i 
mental age (MA) The age for which an obtained score on an 
intelligence test is average or typical. 

mode The score which occurs most often in a distribution. 
multiple-choice items Test items which call for the selection of 
a correct answer from among several options. 

normal probability curve A theoretical distribution curve which 
many distributions of test scores approximate. 

norms Average performances for various groups—expressed as 
age or grade equivalents for school children, as percentiles, and 
in other ways. 

objective test A test answered by checking or circling a number 
or letter. Example: True-false test items. 

options Responses from among which an examinee must make 
a selection. 

percentile rank (PR) The equivalent to an obtained score on a 
scale of 100 points. Example: If a score of 86 has a percentile 
rank (PR) of 63, we know that 63 of the group scored below 86. 
personality test A test (often an inventory) designed to assess 
an individual's personal and social behaviors. 

power test A test designed to measure level of performance 
rather than speed. 

profile A graphic device for representing an examinee's scores 


on several tests. 
projective tests Devices for studying personality through the use 


of ink blots, pictures, designs. 

quartile deviation (Q) A measure of variability. Q equals one- 
half of the range of the middle 50 per cent of scores, 
questionnaire A systematic inventory of questions covering per- 
sonality traits, attitudes, or interests. 

readiness test A measure of a child's readiness or maturity level. 


Often used in reading. 


258 Glossary 


reliability Consistency of test scores. 
reliability coefficient Correlation coefficient giving the self-corre- 
lation of a test. 


skewness The extent to which a distribution of scores is off 
center or biased. 


sociometry Measurement of interpersonal relations within a class 
or other group. 

split-half reliability Reliability coefficient found by splitting a 
test into halves. The two parts of the test usually consist of odd- 
and even-numbered items. 

standard deviation (SD ora) A measure of variability. 

standard score A converted or derived score found by express- 
ing an obtained score as being so far above or below the mean 
in SD units. 

standardized tests Printed tests for which there are norms on 
defined groups. Directions are carefully prescribed. 

test-retest reliabilify The correlation between scores made on 
the same test administered on two occasions. 


T-score A normalized score. 
true-false items 


true or false. 


validity The degree to which a test measures what jt purports 
to measure. There are several sorts of validity, 


Test items which the examinee is to mark as 


z-score An obtained score expressed as a deviation from the 
test mean in terms of c. When 


| terms. | z-scores are converted into а 
frequency distribution with an assigned mean and ©, the 

; they are 
called standard scores. н 
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SUBJECT INDEX 


Ability, meaning of, 46 : 

Aie nn d 106; definition of, 
102; diagnostic uses of, 116; survey, 
106; teacher-made, 210; value of, 115 

Adjustment inventories, 164 

Age norms, 41 

Age scale, 32; value of, 33 " 

American Council on Education Psy- 
chological Examination (ACE), 91 

Aptitudes, meaning of, 130 $ 

Aptitude tests, art, 151; batteries of, 
133; case studies of the use of, 226; 
clerical, 137; how to judge, 154; in- 
terpreting scores in, 155; mechanical, 
133; music, 151; use in professional 
schools, 147 

Army Alpha test, 8, 81 

Army Beta test, 7 

Army General Classification Test 
(AGCT), 8, 81 

Art aptitude tests, 151 

Arthur Point Scale, ‘73; use of, in 
schools, 75 t 

Ascendance-Submission ReactionStudy 
(Allport), 171 

Attitudes, 171; uestionnaires in the 
study of, 172 d F 


Bell Adjustment Inventory, 169 

Bennett Mechanical Comprehension 
Test, 135 

Binet, Alfred, 6; characteristics of his 
tests, 6-7 

Biserial r, in item analysis, 215-216 


California Achievement Tests, 111; 
characteristics of, 113 

California Arithmetic Test, 212 

California Test of Mental Maturity, 
84; description of, 85-87 


California Test of Personality, 166-167 


Centrai tendency, meaning of, 19 

Clerical aptitude tests, 137 

Columbia Research Bureau Spanish 
Test, 211 

Combining test scores, 34-37 

Completion-test items, 202; illustrations 
of, 203-204 

Content analysis, 31, 126, 213 

Cooperative French Test, 123 

Cooperative General Achievement 
Tests, 113-114 

Cooperative Mathematics Test, 122 

Cooperative Science Test, 125 

Correction for guessing, 191; when to 
use, 194 

Correlation, meaning of, 27-28 


Correlation coefficient, computation of, 
251-252 


Criteria, in validity, 154 


Diagnostic tests, clinical, 57-59, 67-68; 
differential, 140-144; educational, 52- 
56, 68, 70-72, 75-77, 93-95 
iagnostic Tests of Achievement in 
Music, 152-153 


Differential Aptitude Tests (DAT), 
140-144 

Educational achievement tests, 102- 
103; and intelligence tests, 103; com- 
Pared with school examinations, 104- 
06; in school subjects, 118; how 


used in Schools, 115-118; what to 
look for in, 125-128 


Educational age (EA), 127 
"SSay tests, described, 204-205; how to 
approve, 205-206; scoring in, 206- 


Evaluation and Adjustment Series, 122 


Frequency distribution, 14- 


c 15; rules for 
constructing, 239-241 
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Subject Index 


Frequency polygon, 15; how to con- 
struct, 241-242 


Galton, Francis, role in testing move- 
ment, 5-6 

General Clerical Test, 139 

Gordon's Personal Profile and Personal 
Inventory, 168-169 

Grade norms, 41 

Group tests, of intelligence, 80-82; in 
guidance, 93-95; norms in, 98; relia- 
bility of, 97; scaling in, 97; use in 
schools, 92-96 

Guidance, educational and vocational, 
93-95, 115-117, 229-233 


Halo effect in ratings, 162 
Histogram, 16-17; how to construct, 
242-243 


Individual differences, importance of, 
3, 12 

Intelligence, meaning of, 45; levels of, 
46-47 

Intelligence quotient (JQ), Stanford- 
Binet, 51-52; constancy of, 61-63; dis- 
tribution of, 53-54; precautions in 
interpreting, 60-61; stability of, 56- 
57 


Intelligence quotient (JQ), Wechsler- 
Bellevue, 65-67; in diagnosis, 67-68; 
range of, 67 

Intelligence tests, factors in the choice 
of, 96-100; group, 80-81; individual, 
44-45; performance, 72-75 

Interest inventories, 174-180 

Towa Silent Reading Tests, 121-122 

IQ (intelligence quotient), 33; as 
standard score, 39; as ratio, 52. See 
Intelligence Quotient 

Item analysis, 214-221; short method 
of, 219-221 

ltem (test), difficulty of, 213; selection 
of, 212-213; validity of, 214 


Kuder Preference Record, 177-179 
Kuhlmann-Anderson Intelligence 
"Tests, 88-89 


Law School Admission Test, 149 


MacQuarrie Test of Mechanical Abil- 
ку, 134-135 
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Matching items, 199; illustrations of, 
200-202 

Mean, 20; in frequency distribution, 
243-246 

Mechanical aptitude tests, 133-137 

Median, 20; in frequency distribution, 
246-247 

Medical College Admission Test, 148- 
149 

Meier Art Judgment Test, 153-154 

Mental age (MA), 32-33 

Mental tests, classification of, 3-5; com- 
pared with physical, 2-3; history of, 
5-12; uses of, in schools, 12 

Metropolitan Achievement Tests, 109- 
111 

Metropolitan Readiness Tests, 118-120 

Minnesota Clerical Test, 137-139 

Minnesota Paper Formboard Test, 144- 
145 

Mode, 20-21, 247-248 

Multiple-choice items, 193-194; illus- 
trations of, 195-197 

Multiple response items, 198-199 

Murphy-Durrell Diagnostic Reading 
Readiness Test, 145-146 

Musical aptitude tests, 151-153 


National Teacher Examination, 150 

Nelson-Denny Reading Test, 212 

Normal distribution, 17; uses of, in 
testing, 17-19 

Normal probability curve. 17; areas 
under, 23 

Norms, 40; age, 33; percentile, 35; 
Standard scores as, 36-38 

Objectives, educational, 105-106 

Objective tests, 80, 105; compared with 
essay examinations, 185-189; item 
types in, 185 

Occupational Interest Inventory, 175- 
177 

Orleans Algebra Prognosis Test, 146- 
147 

Otis Quick-Scoring 
Tests, 87-88 

Percentile rank, 25-27; advantages of, 
33-36; limitations of, 36; norms in 
terms of, 35 

Percentile scale, 33-36 

Performance tests, 72-75 


Mental Ability 
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Personality, meaning of, 157-158; in- 
ventories in the measurement of, 
164; rating scales in the measure- 
ment of, 158-163; sociometric tech- 
niques in, 233-237 

Personality inventories, 164-180; sum- 
mary on the use of, 180-182 

Pintner’s Aspects of Personality, 167- 
168 


Pintner-Cunningham Primary Tests, 
82-84 


Pre-Engineering Ability Tests, 149-150 

Profiles, use of, in comparing test re- 
sults, 35, 85, 143 

Projective tests, meaning of, 158 


Quartile, meaning of, 24-25 

Quartile deviation (Q), calculation of, 
248-249 

Questionnaires, 164 


т, coefficient of correlation, 27-28; cal- 
culation of, 251-252 

Range of scores, 14, 248 

Rating scales, 158-160; factors affect- 
ing, 160-162; improvement in, 162- 
163; summary оп, 163 

Reliability of a test, Coefficient of, 27- 
29; parallel forms in, 29; split-half 


technique in, 223-224; test-retest in, 
29 


Seashore Measures of Musical Talent, 
151-152 


Selection of tests, 
125-128, 154-158 
quentia! Tests of Educational Prog- 
Tess, 114-115 

Sigma scores, Meaning of, 36 

Sociometric techniques, 233-237 

Standard deviation, 22; calculation of, 


in simple series, 22; in a frequen, 
distribution, 249-251 aeey 


factors in, 96-100, 


Subject index 


Standard error, of a score, 29-30 
Standard scores, 36; computation of, 
36-38; normalized or T-scores, 40 

Standardized tests, 210-211 

Stanford Achievement Tests, 106-109 

Stanford-Binet Intelligence Scale, 47- 
52; reliability of, 56-57; scoring in, 
51; uses of, in the schools, 52-60; 
validity of, 61-63 

Strong Vocational Interest Blank, 179- 
180 

Study of Values 


(Allport-Vernon- 
Lindzey) ‚ 172-173 


Teacher-made tests, 219 ff, 

Terman-McNemar Test of Mental 
Ability, 89-91 

Test items, varicties of, 185 ff. 


Thurstone Temperament Schedule, 
170-171 

True-False items, 189-190; illustrations 
of, 192-193 

T-score, 40 


Turse Shorthand Aptitude Test, 147 
Validity, of a test, 30-31; of test items, 
214 


Variability, in scores, 21 

Verbal ability, and performance abil- 
ity, 64-65, 68, 75-77 

Wechsler-Bellevue 
63-67; in diagnosis, 
67-68 


Wechsler Adult Intelligence Scale, 63 
Wechsler 1 


Intelligence Scale for Chil- 
dren, 68-69; compared with Stan- 
ford-Binet, 70; MA in, 72; range and 
stability of 1Q’s in, 71 


z-score, 36 


Intelligence Scale, 
, 68; in the schools, 
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